How to store large scale comparison?

How to store large scale comparison? - python

I'm comparing images between thousands of users who can have between 1 and 12 photos. The comparison between two photos returns a score that I need to be stored so I don't make the comparison twice.
What is the best way of storing it?
I thought about storing in a table with one photo per row/column but this can quickly get out of hand

Multi-indexed pandas if you want to work in memory, like so:
df = pd.DataFrame(index=[['Alice/foo.png'], ['Bob/bar.png']], columns=['user1', 'user2', 'score'], data=[['Alice', 'Bob', 42.0]])
df.index.names = ['photo1', 'photo2']
df
user1 user2 score
photo1 photo2
user1/foo.png user2/bar.png Alice Bob 42.0
SQLite if you want to work on disk, like so
import sqlite3
# The important part: defining the table
conn = sqlite3.Connection('photos.sqlite')
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS photos (photoid INTEGER PRIMARY KEY, photopath TEXT UNIQUE, userid INT)')
c.execute('CREATE TABLE IF NOT EXISTS photoscores (photoid1 INTEGER, photoid2 INTEGER, userid1 INT, userid2 INT, score REAL)')
c.execute('CREATE UNIQUE INDEX photopair on photoscores (photoid1, photoid2)')
conn.commit()
# Example of populating the table
sql = ('INSERT INTO photos(photopath, userid) VALUES ("foo.png", 1)',
'INSERT INTO photos(photopath, userid) VALUES ("bar.png", 2)')
for statement in sql:
c.execute(statement)
conn.commit()
sql = 'SELECT photoid FROM PHOTOS'
c.execute(sql)
values = [v[0] for v in enumerate(c.fetchall())]
# This is more complicated than it needs to be, especially since
# values will always be sorted if this code is run, but I'm just
# emphasizing the need to keep the photo ids and user ids aligned
sql = ('INSERT OR REPLACE INTO photoscores VALUES (%s, %s, %s, %s, 42.0)'
% (tuple(sorted(values)) + ((1, 2) if values == sorted(values) else (2, 1))))
c.execute(sql)
conn.commit()
import pandas as pd
pd.read_sql('SELECT * FROM photoscores', conn)
photoid1 photoid2 userid1 userid2 score
0 0 1 1 2 42.0
If you use the SQL code, I'd suggest always sorting the pair of photo IDs you compare.
The key thing is that you want something with an underlying hash map that will quickly tell you whether an existing pair of photos has already been compared.

Related

Insert record from list if not exists in table

cHandler = myDB.cursor()
cHandler.execute('select UserId,C1,LogDate from DeviceLogs_12_2019') // data from remote sql server database
curs = connection.cursor()
curs.execute("""select * from biometric""") //data from my database table
lst = []
result= cHandler.fetchall()
for row in result:
lst.append(row)
lst2 = []
result2= curs.fetchall()
for row in result2:
lst2.append(row)
t = []
r = [elem for elem in lst if not elem in lst2]
for i in r:
print(i)
t.append(i)
for i in t:
frappe.db.sql("""Insert into biometric(UserId,C1,LogDate) select '%s','%s','%s' where not exists(select * from biometric where UserID='%s' and LogDate='%s')""",(i[0],i[1],i[2],i[0],i[2]),as_dict=1)
I am trying above code to insert data into my table if record not exists but getting error :
pymysql.err.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '1111'',''in'',''2019-12-03 06:37:15'' where not exists(select * from biometric ' at line 1")
Is there anything I am doing wrong or any other way to achieve this?

It appears you have potentially four problems:
There is a from clause missing between select and where not exists.
When using a prepared statement you do not enclose your placeholder arguments, %s, within quotes. Your SQL should be:
Your loop:
Loop:
t = []
r = [elem for elem in lst if not elem in lst2]
for i in r:
print(i)
t.append(i)
If you are trying to only include rows from the remote site that will not be duplicates, then you should explicitly check the two fields that matter, i.e. UserId and LogDate. But what is the point since your SQL is taking care of making sure that you are excluding these duplicate rows? Also, what is the point of copying everything form r to t?
SQL:
Insert into biometric(UserId,C1,LogDate) select %s,%s,%s from DUAL where not exists(select * from biometric where UserID=%s and LogDate=%s
But here is the problem even with the above SQL:
If the not exists clause is false, then the select %s,%s,%s from DUAL ... returns no columns and the column count will not match the number of columns you are trying to insert, namely three.
If your concern is getting an error due to duplicate keys because (UserId, LogDate) is either a UNIQUE or PRIMARY KEY, then add the IGNORE keyword on the INSERT statement and then if a row with the key already exists, the insertion will be ignored. But there is no way of knowing since you have not provided this information:
for i in t:
frappe.db.sql("Insert IGNORE into biometric(UserId,C1,LogDate) values(%s,%s,%s)",(i[0],i[1],i[2]))
If you do not want multiple rows with the same (UserId, LogDate) combination, then you should define a UNIQUE KEY on these two columns and then the above SQL should be sufficient. There is also an ON DUPLICATE KEY SET ... variation of the INSERT statement where if the key exists you can do an update instead (look this up).
If you don't have a UNIQUE KEY defined on these two columns or you need to print out those rows which are being updated, then you do need to test for the presence of the existing keys. But this would be the way to do it:
cHandler = myDB.cursor()
cHandler.execute('select UserId,C1,LogDate from DeviceLogs_12_2019') // data from remote sql server database
rows = cHandler.fetchall()
curs = connection.cursor()
for row in rows:
curs.execute("select UserId from biometric where UserId=%s and LogDate=%s", (ros[0], row[2])) # row already in biometric table?
biometric_row = curs.fetchone()
if biometric_row is None: # no, it is not
print(row)
frappe.db.sql("Insert into biometric(UserId,C1,LogDate) values(%s, %s, %s)", (row[0],row[1],row[2]))

How to Write a Value Into a Chosen Row and Column in SQLite using Python

So I'm trying to work with an SQLite database at the moment and was hoping to get some input on the best way of writing a value to a particular row and column (so cell) of a database. I know how to write to the database row by row, so basically appending a row to the end of the database each time, but instead I would like to write the data into the database non sequentially.
I have put together an arbitrary example below to illustrate what I'm trying to do by using apples. In this case I create two tables in a database. The first table will be my ID table and is called apples. This will contain a primary key and two columns for the apple name and the farm it is grown in. The second table keyFeatures will contain the primary key again which refers back to the ID of the apple in the apple table, but also a column for the taste, texture and the colour.
In the example below I have only the taste, texture and colour of the apple Pink Lady from farm 3. I now want to write that information into row 3 of the table keyFeature in the relevant columns before any of the other rows are populated. For the life of me I can't work out how to do this. I assume I need to position the cursor to the correct row, but from the documentation I am not clear on how to achieve this. I'm sure that this is a trivial problem and if anyone can point me in the right direction I would greatly appreciate it!
import sqlite3
dbName = 'test.db'
################# Create the Database File ###################
# Connecting to the database file
conn = sqlite3.connect(dbName)
c = conn.cursor()
#Create the identification table with names and origin
c.execute('''CREATE TABLE apples(appleID INT PRIMARY KEY, Name TEXT,
farmGrown TEXT)''')
#Create the table with key data
c.execute('''CREATE TABLE keyFeature(appleID INT PRIMARY KEY, taste INT,
texture INT, Colour TEXT)''')
#Populate apples table and id in keyFeature table
AppleName = ['Granny Smith', 'Golden Delicious', 'Pink Lady', 'Russet']
appleFarmGrown = ['Farm 1', 'Farm 2', 'Farm 3', 'Farm 4']
id = []
for i in range(len(AppleName)):
id.append(i+1)
c = conn.cursor()
c.execute('''INSERT INTO apples(appleID, Name, farmGrown)
VALUES(?,?,?)''', (id[i], AppleName[i], appleFarmGrown[i]))
c.execute('''INSERT INTO keyFeature(appleID)
VALUES(?)''', (id[i],))
#Current Apple to populate row in keyFeature
appleCurrent = (('Pink Lady','Farm 3'))
tasteCurrent = 4
textureCurrent = 5
colourCurrent = 'red'
#Find ID and write into the database
c.execute("SELECT appleID FROM apples")
appleIDDB = c.fetchall()
c.execute("SELECT name FROM apples")
nameDB = c.fetchall()
c.execute("SELECT farmGrown FROM apples")
farmGrownDB = c.fetchall()
conn.commit()
conn.close()
# I assume that if I close the connection the cursor whould be positioned at the
# first row again but this doesn't appear to be the case
conn = sqlite3.connect(dbName)
for i in range(len(appleIDDB)):
c = conn.cursor()
if ((nameDB[i][0] == appleCurrent[0]) and (farmGrownDB[i][0] == appleCurrent[1])):
idCurrent = appleIDDB[i][0]
print("This apple matches the apple stored with id number " +str(idCurrent))
# This writes into the fifth row of the table
c.execute('''INSERT INTO keyFeature(taste, texture, Colour)
VALUES(?,?,?)''', (tasteCurrent, textureCurrent, colourCurrent))
conn.commit()
conn.close()

You're almost there.
A relational database such as sqlite isn't quite like a table in a speadsheet. Instead of a list of rows with a certain order you just have "a big bag of rows" (technically called a set of tuples) and you can sort them any way you want.
The way we solve your problem, as you've already identified when you created the table, is to make sure every row has a key that allows us to identify it (like your apple ID). When we want this key to represent an ID in another table, that's called a foreign key. So we just need to add a foreign key (called appleID) to the keyFeature table, and use that whenever we add a row to it.
First, get rid of this from your first for loop, we don't need it at this stage, the table can sit empty.
c.execute('''INSERT INTO keyFeature(appleID)
VALUES(?)''', (id[i],))
Next, you don't need to get the whole table to find the apple you want, you can just select the one you are interested in:
c.execute("SELECT appleID FROM apples WHERE name=? AND farm=?",("Pink Lady", "Farm 3"))
idCurrent = c.fetchone()[0]
The real trick is, when adding data to keyFeature, we have to insert all the data in one statement. This way a new tuple (row) is created with the ID and all the other information all at once. As if it were in "the right place" in the table.
c.execute('''INSERT INTO keyFeature(appleID, taste, texture, Colour)
VALUES(?,?,?,?)''', (idCurrent, tasteCurrent, textureCurrent, colourCurrent))
Now we can retrieve information from the keyFeature table using the ID of the apple we're interested in.
c.execute("SELECT taste, texture, Colour FROM keyFeature WHERE apple_id=?", (my_apple_id,))
Final complete code:
import sqlite3
dbName = 'test.db'
################# Create the Database File ###################
# Connecting to the database file
conn = sqlite3.connect(dbName)
c = conn.cursor()
#Create the identification table with names and origin
c.execute('''CREATE TABLE apples(appleID INT PRIMARY KEY, Name TEXT,
farmGrown TEXT)''')
#Create the table with key data
c.execute('''CREATE TABLE keyFeature(appleID INT PRIMARY KEY, taste INT,
texture INT, Colour TEXT)''')
#Populate apples table and id in keyFeature table
AppleName = ['Granny Smith', 'Golden Delicious', 'Pink Lady', 'Russet']
appleFarmGrown = ['Farm 1', 'Farm 2', 'Farm 3', 'Farm 4']
id = []
for i in range(len(AppleName)):
id.append(i+1)
c = conn.cursor()
c.execute('''INSERT INTO apples(appleID, Name, farmGrown)
VALUES(?,?,?)''', (id[i], AppleName[i], appleFarmGrown[i]))
#Current Apple to populate row in keyFeature
appleCurrent = ('Pink Lady','Farm 3')
tasteCurrent = 4
textureCurrent = 5
colourCurrent = 'red'
#Find ID and write into the database
c.execute("SELECT appleID FROM apples WHERE name=? AND farm=?",(appleCurrent[0], appleCurrent[1]))
idCurrent = c.fetchone()[0]
c.execute('''INSERT INTO keyFeature(appleID, taste, texture, Colour)
VALUES(?,?,?,?)''', (idCurrent, tasteCurrent, textureCurrent, colourCurrent))
conn.commit()
conn.close()

Count the number of non-null values in each column of each table in MySQL

Is there a way to produce this output using SQL for all tables in a given database (using MySQL) without having to specify individual table names and columns?
Table Column Count
---- ---- ----
Table1 Col1 0
Table1 Col2 100
Table1 Col3 0
Table1 Col4 67
Table1 Col5 0
Table2 Col1 30
Table2 Col2 0
Table2 Col3 2
... ... ...
The purpose is to identify columns for analysis based on how much data they contain (a significant number of columns are empty).
The 'workaround' solution using python (one table at a time):
# Libraries
import pymysql
import pandas as pd
import pymysql.cursors
# Connect to mariaDB
connection = pymysql.connect(host='localhost',
user='root',
password='my_password',
db='my_database',
charset='latin1',
cursorclass=pymysql.cursors.DictCursor)
# Get column metadata
sql = """SELECT *
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='my_database'
"""
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
# Store in dataframe
df = pd.DataFrame(result)
df = df[['TABLE_NAME', 'COLUMN_NAME']]
# Build SQL string (one table at a time for now)
my_table = 'my_table'
df_my_table = df[df.TABLE_NAME==my_table].copy()
cols = list(df_my_table.COLUMN_NAME)
col_strings = [''.join(['COUNT(', x, ') AS ', x, ', ']) for x in cols]
col_strings[-1] = col_strings[-1].replace(',','')
sql = ''.join(['SELECT '] + col_strings + ['FROM ', my_table])
# Execute
with connection.cursor() as cursor:
cursor.execute(sql)
result = cursor.fetchall()
The result is a dictionary of column names and counts.

Basically, no. See also this answer.
Also, note that the closest match of the answer above is actually the method you're already using, but less efficiently implemented in reflective SQL.
I'd do the same as you did - build a SQL like
SELECT
COUNT(*) AS `count`,
SUM(IF(columnName1 IS NULL,1,0)) AS columnName1,
...
SUM(IF(columnNameN IS NULL,1,0)) AS columnNameN
FROM tableName;
using information_schema as a source for table and column names, then execute it for each table in MySQL, then disassemble the single row returned into N tuple entries (tableName, columnName, total, nulls).

It is possible, but it's not going to be quick.
As mentioned in a previous answer you can work your way through the columns table in the information_schema to build queries to get the counts. It's then just a question of how long you are prepared to wait for the answer because you end up counting every row, for every column, in every table. You can speed things up a bit if you exclude columns that are defined as NOT NULL in the cursor (i.e. IS_NULLABLE = 'YES').
The solution suggested by LSerni is going to be much faster, particularly if you have very wide tables and/or high row counts, but would require more work handling the results.
e.g.
DELIMITER //
DROP PROCEDURE IF EXISTS non_nulls //
CREATE PROCEDURE non_nulls (IN sname VARCHAR(64))
BEGIN
-- Parameters:
-- Schema name to check
-- call non_nulls('sakila');
DECLARE vTABLE_NAME varchar(64);
DECLARE vCOLUMN_NAME varchar(64);
DECLARE vIS_NULLABLE varchar(3);
DECLARE vCOLUMN_KEY varchar(3);
DECLARE done BOOLEAN DEFAULT FALSE;
DECLARE cur1 CURSOR FOR
SELECT `TABLE_NAME`, `COLUMN_NAME`, `IS_NULLABLE`, `COLUMN_KEY`
FROM `information_schema`.`columns`
WHERE `TABLE_SCHEMA` = sname
ORDER BY `TABLE_NAME` ASC, `ORDINAL_POSITION` ASC;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;
DROP TEMPORARY TABLE IF EXISTS non_nulls;
CREATE TEMPORARY TABLE non_nulls(
table_name VARCHAR(64),
column_name VARCHAR(64),
column_key CHAR(3),
is_nullable CHAR(3),
rows BIGINT,
populated BIGINT
);
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO vTABLE_NAME, vCOLUMN_NAME, vIS_NULLABLE, vCOLUMN_KEY;
IF done THEN
LEAVE read_loop;
END IF;
SET #sql := CONCAT('INSERT INTO non_nulls ',
'(table_name,column_name,column_key,is_nullable,rows,populated) ',
'SELECT \'', vTABLE_NAME, '\',\'', vCOLUMN_NAME, '\',\'', vCOLUMN_KEY, '\',\'',
vIS_NULLABLE, '\', COUNT(*), COUNT(`', vCOLUMN_NAME, '`) ',
'FROM `', sname, '`.`', vTABLE_NAME, '`');
PREPARE stmt1 FROM #sql;
EXECUTE stmt1;
DEALLOCATE PREPARE stmt1;
END LOOP;
CLOSE cur1;
SELECT * FROM non_nulls;
END //
DELIMITER ;
call non_nulls('sakila');

Python SQLite3: want to iteratively fetching rows, but code is pulling every other row

I'm trying to accomplish a very simple task:
Create a table in SQLite
Insert several rows
Query a single column in the table and pull back each row
Code to create tab:
import sqlite3
sqlite_file = '/Users/User/Desktop/DB.sqlite'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()
c.execute('''CREATE TABLE ListIDTable(ID numeric, Day numeric, Month
numeric, MonthTxt text, Year numeric, ListID text, Quantity text)''')
values_to_insert = [
(1,16,7,"Jul",2015,"XXXXXXX1","Q2"),
(2,16,7,"Jul",2015,"XXXXXXX2","Q2"),
(3,14,7,"Jul",2015,"XXXXXXX3","Q1"),
(4,14,7,"Jul",2015,"XXXXXXX4","Q1")] #Entries continue similarly
c.executemany("INSERT INTO ListIdTable (ID, Day, Month, MonthTxt,
Year, ListID, Quantity) values (?,?,?,?,?,?,?)", values_to_insert)
conn.commit()
conn.close()
When I look at this table in SQLite DB Browser, everything looks fine.
Here's my code to try and query the above table:
import sqlite3
sqlite_file = '/Users/User/Desktop/DB.sqlite'
conn = sqlite3.connect(sqlite_file)
conn.row_factory = sqlite3.Row
c = conn.cursor()
for row in c.execute('select * from ListIDTable'):
r = c.fetchone()
ID = r['ID']
print (ID)
I should get a print out of 1, 2, 3, 4.
However, I only get 2 and 4.
My code actually uploads 100 entries to the table, but still, when I query, I only get ID printouts of even numbers (i.e. 2, 4, 6, 8 etc.).
Thanks for any advice on fixing this.

You don't need to fetchone in the loop -- The loop is already fetching the values (one at a time). If you fetchone while you're iterating, you'll only see half the data because the loop fetches one and then you immediately fetch the next one (without ever looking at the one that was fetched by the loop):
for r in c.execute('select * from ListIDTable'):
ID = r['ID']
print (ID)

Python Sqlite3 insert operation with a list of column names

Normally, if i want to insert values into a table, i will do something like this (assuming that i know which columns that the values i want to insert belong to):
conn = sqlite3.connect('mydatabase.db')
conn.execute("INSERT INTO MYTABLE (ID,COLUMN1,COLUMN2)\
VALUES(?,?,?)",[myid,value1,value2])
But now i have a list of columns (the length of list may vary) and a list of values for each columns in the list.
For example, if i have a table with 10 columns (Namely, column1, column2...,column10 etc). I have a list of columns that i want to update.Let's say [column3,column4]. And i have a list of values for those columns. [value for column3,value for column4].
How do i insert the values in the list to the individual columns that each belong?

As far as I know the parameter list in conn.execute works only for values, so we have to use string formatting like this:
import sqlite3
conn = sqlite3.connect(':memory:')
conn.execute('CREATE TABLE t (a integer, b integer, c integer)')
col_names = ['a', 'b', 'c']
values = [0, 1, 2]
conn.execute('INSERT INTO t (%s, %s, %s) values(?,?,?)'%tuple(col_names), values)
Please notice this is a very bad attempt since strings passed to the database shall always be checked for injection attack. However you could pass the list of column names to some injection function before insertion.
EDITED:
For variables with various length you could try something like
exec_text = 'INSERT INTO t (' + ','.join(col_names) +') values(' + ','.join(['?'] * len(values)) + ')'
conn.exec(exec_text, values)
# as long as len(col_names) == len(values)

Of course string formatting will work, you just need to be a bit cleverer about it.
col_names = ','.join(col_list)
col_spaces = ','.join(['?'] * len(col_list))
sql = 'INSERT INTO t (%s) values(%s)' % (col_list, col_spaces)
conn.execute(sql, values)

I was looking for a solution to create columns based on a list of unknown / variable length and found this question. However, I managed to find a nicer solution (for me anyway), that's also a bit more modern, so thought I'd include it in case it helps someone:
import sqlite3
def create_sql_db(my_list):
file = 'my_sql.db'
table_name = 'table_1'
init_col = 'id'
col_type = 'TEXT'
conn = sqlite3.connect(file)
c = conn.cursor()
# CREATE TABLE (IF IT DOESN'T ALREADY EXIST)
c.execute('CREATE TABLE IF NOT EXISTS {tn} ({nf} {ft})'.format(
tn=table_name, nf=init_col, ft=col_type))
# CREATE A COLUMN FOR EACH ITEM IN THE LIST
for new_column in my_list:
c.execute('ALTER TABLE {tn} ADD COLUMN "{cn}" {ct}'.format(
tn=table_name, cn=new_column, ct=col_type))
conn.close()
my_list = ["Col1", "Col2", "Col3"]
create_sql_db(my_list)
All my data is of the type text, so I just have a single variable "col_type" - but you could for example feed in a list of tuples (or a tuple of tuples, if that's what you're into):
my_other_list = [("ColA", "TEXT"), ("ColB", "INTEGER"), ("ColC", "BLOB")]
and change the CREATE A COLUMN step to:
for tupl in my_other_list:
new_column = tupl[0] # "ColA", "ColB", "ColC"
col_type = tupl[1] # "TEXT", "INTEGER", "BLOB"
c.execute('ALTER TABLE {tn} ADD COLUMN "{cn}" {ct}'.format(
tn=table_name, cn=new_column, ct=col_type))

As a noob, I can't comment on the very succinct, updated solution #ron_g offered. While testing, though I had to frequently delete the sample database itself, so for any other noobs using this to test, I would advise adding in:
c.execute('DROP TABLE IF EXISTS {tn}'.format(
tn=table_name))
Prior the the 'CREATE TABLE ...' portion.
It appears there are multiple instances of
.format(
tn=table_name ....)
in both 'CREATE TABLE ...' and 'ALTER TABLE ...' so trying to figure out if it's possible to create a single instance (similar to, or including in, the def section).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to store large scale comparison? - python

Related

Insert record from list if not exists in table

How to Write a Value Into a Chosen Row and Column in SQLite using Python

Count the number of non-null values in each column of each table in MySQL

Python SQLite3: want to iteratively fetching rows, but code is pulling every other row

Python Sqlite3 insert operation with a list of column names

Categories

Resources