Document Similarity: Comparing two documents efficiently

Document Similarity: Comparing two documents efficiently - python

I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries
This is what I have so far, it works, but is super slow:
# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)
#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)
# loop through each token in doca and see if one matches in docb
for x in doca_dic:
if docb_dic.has_key(x):
#calculate the similarity by summing the products of the tf-idf_norm
similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity
I'm pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated.
Thanks.

A Python point: adict.has_key(k) is obsolete in Python 2.X and vanished in Python 3.X. k in adict as an expression has been available since Python 2.2; use it instead. It will be faster (no method call).
An any-language practical point: iterate over the shorter dictionary.
Combined result:
if len(doca_dic) < len(docb_dict):
short_dict, long_dict = doca_dic, docb_dic
else:
short_dict, long_dict = docb_dic, doca_dic
similarity = 0
for x in short_dict:
if x in long_dict:
#calculate the similarity by summing the products of the tf-idf_norm
similarity += short_dict[x] * long_dict[x]
And if you don't need the two dictionaries for anything else, you could create only the A one and iterate over the B (key, value) tuples as they pop out of your B query. After the docb = cursor2.fetchall(), replace all following code by this:
similarity = 0
for b_token, b_value in docb:
if b_token in doca_dic:
similarity += doca_dic[b_token] * b_value
Alternative to the above code: This is doing more work but it's doing more of the iterating in C instead of Python and may be faster.
similarity = sum(
doca_dic[k] * docb_dic[k]
for k in set(doca_dic) & set(docb_dic)
)
Final version of the Python code
# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
# Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
if len(doca) < len(docb):
short_doc, long_doc = doca, docb
else:
short_doc, long_doc = docb, doca
long_dict = dict(long_doc) # yes, it should be that simple
similarity = 0
for key, value in short_doc:
if key in long_dict:
similarity += long_dict[key] * value
Another practical point: you haven't said which part of it is slow ... working on the dicts or doing the selects? Put some calls of time.time() into your script.
Consider pushing ALL the work onto the database. Following example uses a hardwired SQLite query but the principle is the same.
C:\junk\so>sqlite3
SQLite version 3.6.14
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table atable(docid text, token text, score float,
primary key (docid, token));
sqlite> insert into atable values('a', 'apple', 12.2);
sqlite> insert into atable values('a', 'word', 29.67);
sqlite> insert into atable values('a', 'zulu', 78.56);
sqlite> insert into atable values('b', 'apple', 11.0);
sqlite> insert into atable values('b', 'word', 33.21);
sqlite> insert into atable values('b', 'zealot', 11.56);
sqlite> select sum(A.score * B.score) from atable A, atable B
where A.token = B.token and A.docid = 'a' and B.docid = 'b';
1119.5407
sqlite>
And it's worth checking that the database table is appropriately indexed (e.g. one on token by itself) ... not having a usable index is a good way of making an SQL query run very slowly.
Explanation: Having an index on token may make either your existing queries or the "do all the work in the DB" query or both run faster, depending on the whims of the query optimiser in your DB software and the phase of the moon. If you don't have a usable index, the DB will read ALL the rows in your table -- not good.
Creating an index: create index atable_token_idx on atable(token);
Dropping an index: drop index atable_token_idx;
(but do consult the docs for your DB)

What about pushing some of the work on the DB?
With a join you can have a result that is basically
Token A.tfidf_norm B.tfidf_norm
-----------------------------------------
Apple 12.2 11.00
...
Word 29.87 33.21
Zealot 0.00 11.56
Zulu 78.56 0.00
And you just have to scan the cursor and do your operations.
If you don't need to know if one word is in one document and missing in the other one you don't need an outer join, and the list will be the intersection of the two sets. The example I included above assigns automatically a "0" for words missing from one of the two documents. See what your "matching" functions requires.

One sql query can do the job:
SELECT sum(index1.tfidf_norm*index2.tfidf_norm) FROM index index1, index index2 WHERE index1.token=index2.token AND index1.doc_id=? AND index2.doc_id=?
Just replace the '?' with the 2 document id respectively.

Related

Properly format SQL query when insert into variable number of columns

I'm using psycopg2 to interact with a PostgreSQL database. I have a function whereby any number of columns (from a single column to all columns) in a table could be inserted into. My question is: how would one properly, dynamically, construct this query?
At the moment I am using string formatting and concatenation and I know this is the absolute worst way to do this. Consider the below code where, in this case, my unknown number of columns (i.e. keys from a dict is in fact 2):
dictOfUnknownLength = {'key1': 3, 'key2': 'myString'}
def createMyQuery(user_ids, dictOfUnknownLength):
fields, values = list(), list()
for key, val in dictOfUnknownLength.items():
fields.append(key)
values.append(val)
fields = str(fields).replace('[', '(').replace(']', ')').replace("'", "")
values = str(values).replace('[', '(').replace(']', ')')
query = f"INSERT INTO myTable {fields} VALUES {values} RETURNING someValue;"
query = INSERT INTO myTable (key1, key2) VALUES (3, 'myString') RETURNING someValue;
This provides a correctly formatted query but is of course prone to SQL injections and the like and, as such, is not an acceptable method of achieving my goal.
In other queries I am using the recommended methods of query construction when handling a known number of variables (%s and separate argument to .execute() containing variables) but I'm unsure how to adapt this to accommodate an unknown number of variables without using string formatting.
How can I elegantly and safely construct a query with an unknown number of specified insert columns?

To add to your worries, the current methodology using .replace() is prone to edge cases where fields or values contain [, ], or '. They will get replaced no matter what and may mess up your query.
You could always use .join() to join a variable number of values in your list. To top it up, format the query appropriately with %s after VALUES and pass your arguments into .execute().
Note: You may also want to consider the case where the number of fields is not equal to the number values.
import psycopg2
conn = psycopg2.connect("dbname=test user=postgres")
cur = conn.cursor()
dictOfUnknownLength = {'key1': 3, 'key2': 'myString'}
def createMyQuery(user_ids, dictOfUnknownLength):
# Directly assign keys/values.
fields, values = list(dictOfUnknownLength.keys()), list(dictOfUnknownLength.values())
if len(fields) != len(values):
# Raise an error? SQL won't work in this case anyways...
pass
# Stringify the fields and values.
fieldsParam = ','.join(fields) # "key1, key2"
valuesParam = ','.join(['%s']*len(values))) # "%s, %s"
# "INSERT ... (key1, key2) VALUES (%s, %s) ..."
query = 'INSERT INTO myTable ({}) VALUES ({}) RETURNING someValue;'.format(fieldsParam, valuesParam)
# .execute('INSERT ... (key1, key2) VALUES (%s, %s) ...', [3, 'myString'])
cur.execute(query, values) # Anti-SQL-injection: pass placeholder
# values as second argument.

Python, SQlite3 - Querying condition equals and not in (Multiple)

How can I query in python both for where a condition equals a value i.e. r.user = (given user id) and where a value is NOT IN (given list of movie ids) the result set.
This is what I currently have
placeholder = '?' # For SQLite. See DBAPI paramstyle.
placeholders = ', '.join(placeholder * len(l))
query = 'SELECT r.user, r.movie, r.rating, m.title FROM ratings r JOIN movies m ON (r.movie = m.id) ' \
'WHERE r.user = 405 AND r.rating >= 3 AND r.movie NOT IN (%s)' % placeholders
cursor.execute(query, ('405', l))
movies_table = cursor.fetchall()
l refers to an array of values i.e. so I can get the result set where movie id is not in the list of values.
Thanks very much,
I'm currently able to get one or the other but not both due to what seems the number of parameters applied or so.

You need to call cursor.execute() with one item per-placeholder.
Try something like this:
cursor.execute(query, tuple(l))
If you want to append the 405 to the list of values, then you can do something like:
cursor.execute(query, (405, *l))

Weird behavior with SQLite insert or replace

I am trying to increment the count of a row in an SQLite database if the row exists, or add a new row if it doesn't exist the way it is done in this SO post. However I'm getting some weird behavior when I try to execute this SQL proc many times in quick succession. As an example I tried running this code:
db = connect_to_db()
c = db.cursor()
for i in range(10):
c.execute("INSERT OR REPLACE INTO subject_words (subject, word, count) VALUES ('subject', 'word', COALESCE((SELECT count + 1 FROM subject_words WHERE subject = 'subject' AND word = 'word'), 1));")
db.commit()
db.close()
And it inserted the following into the database
sqlite> select * from subject_words;
subject|word|1
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
Which totals to 19 entries of the word 'word' with subject 'subject'. Can anyone explain this weird behavior?

I don't think you've understood what INSERT OR REPLACE actually does. The REPLACE clause would only come into play if it was not possible to do the insertion, because a unique constraint was violated. An example might be if your subject column was the primary key.
However, without any primary keys or other constraints, there's nothing being violated by inserting multiple rows with the same subject; so there's no reason to invoke the REPLACE clause.

That operation is much easier to write and understand if you do it with two SQL statements:
c.execute("""UPDATE subject_words SET count = count + 1
WHERE subject = ? AND WORD = ?""",
['subject', 'word'])
if c.rowcount == 0:
c.execute("INSERT INTO subject_words (subject, word, count) VALUES (?,?,?)",
['subject', 'word', 1])
This does not require a UNIQUE constraint on the columns you want to check, and is not any less efficient.

How to get the numbers of data rows from sqlite table in python

I am trying to get the numbers of rows returned from an sqlite3 database in python but it seems the feature isn't available:
Think of php mysqli_num_rows() in mysql
Although I devised a means but it is a awkward: assuming a class execute sql and give me the results:
# Query Execution returning a result
data = sql.sqlExec("select * from user")
# run another query for number of row checking, not very good workaround
dataCopy = sql.sqlExec("select * from user")
# Try to cast dataCopy to list and get the length, I did this because i notice as soon
# as I perform any action of the data, data becomes null
# This is not too good as someone else can perform another transaction on the database
# In the nick of time
if len(list(dataCopy)) :
for m in data :
print("Name = {}, Password = {}".format(m["username"], m["password"]));
else :
print("Query return nothing")
Is there a function or property that can do this without stress.

Normally, cursor.rowcount would give you the number of results of a query.
However, for SQLite, that property is often set to -1 due to the nature of how SQLite produces results. Short of a COUNT() query first you often won't know the number of results returned.
This is because SQLite produces rows as it finds them in the database, and won't itself know how many rows are produced until the end of the database is reached.
From the documentation of cursor.rowcount:
Although the Cursor class of the sqlite3 module implements this attribute, the database engine’s own support for the determination of “rows affected”/”rows selected” is quirky.
For executemany() statements, the number of modifications are summed up into rowcount.
As required by the Python DB API Spec, the rowcount attribute “is -1 in case no executeXX() has been performed on the cursor or the rowcount of the last operation is not determinable by the interface”. This includes SELECT statements because we cannot determine the number of rows a query produced until all rows were fetched.
Emphasis mine.
For your specific query, you can add a sub-select to add a column:
data = sql.sqlExec("select (select count() from user) as count, * from user")
This is not all that efficient for large tables, however.
If all you need is one row, use cursor.fetchone() instead:
cursor.execute('SELECT * FROM user WHERE userid=?', (userid,))
row = cursor.fetchone()
if row is None:
raise ValueError('No such user found')
result = "Name = {}, Password = {}".format(row["username"], row["password"])

import sqlite3
conn = sqlite3.connect(path/to/db)
cursor = conn.cursor()
cursor.execute("select * from user")
results = cursor.fetchall()
print len(results)
len(results) is just what you want

Use following:
dataCopy = sql.sqlExec("select count(*) from user")
values = dataCopy.fetchone()
print values[0]

When you just want an estimate beforehand, then simple use COUNT():
n_estimate = cursor.execute("SELECT COUNT() FROM user").fetchone()[0]
To get the exact number before fetching, use a locked "Read transaction", during which the table won't be changed from outside, like this:
cursor.execute("BEGIN") # start transaction
n = cursor.execute("SELECT COUNT() FROM user").fetchone()[0]
# if n > big: be_prepared()
allrows=cursor.execute("SELECT * FROM user").fetchall()
cursor.connection.commit() # end transaction
assert n == len(allrows)
Note: A normal SELECT also locks - but just until it itself is completely fetched or the cursor closes or commit() / END or other actions implicitely end the transaction ...

I've found the select statement with count() to be slow on a very large DB. Moreover, using fetch all() can be very memory-intensive.
Unless you explicitly design your database so that it does not have a rowid, you can always try a quick solution
cur.execute("SELECT max(rowid) from Table")
n = cur.fetchone()[0]
This will tell you how many rows your database has.

I did it like
cursor.execute("select count(*) from my_table")
results = cursor.fetchone()
print(results[0])

this code worked for me:
import sqlite3
con = sqlite3.connect(your_db_file)
cursor = con.cursor()
result = cursor.execute("select count(*) from your_table").fetchall() #returns array of tupples
num_of_rows = result[0][0]

A simple alternative approach here is to use fetchall to pull a column into a python list, then count the length of the list. I don't know if this is pythonic or especially efficient but it seems to work:
rowlist = []
c.execute("SELECT {rowid} from {whichTable}".\
format (rowid = "rowid", whichTable = whichTable))
rowlist = c.fetchall ()
rowlistcount = len(rowlist)
print (rowlistcount)

The following script works:
def say():
global s #make s global decleration
vt = sqlite3.connect('kur_kel.db') #connecting db.file
bilgi = vt.cursor()
bilgi.execute(' select count (*) from kuke ') #execute sql command
say_01=bilgi.fetchone() #catch one query from executed sql
print (say_01[0]) #catch a tuple first item
s=say_01[0] # assign variable to sql query result
bilgi.close() #close query
vt.close() #close db file

Inserting multiple types into an SQLite database with Python

I'm trying to create an SQLite 3 database from Python. I have a few types I'd like to insert into each record: A float, and then 3 groups of n floats, currently a tuple but could be an array or list.. I'm not well-enough versed in Python to understand all the differences. My problem is the INSERT statement.
DAS = 12345
lats = (42,43,44,45)
lons = (10,11,12,13)
times = (1,2,3,4,5,6,7,8,9)
import sqlite3
connection = sqlite3.connect("test.db")
cursor = connection.cursor()
cursor.execute( "create table foo(DAS LONG PRIMARY KEY,lats real(4),lons real(4), times real(9) )" )
I'm not sure what comes next. Something along the lines of:
cmd = 'INSERT into foo values (?,?,?,?), ..."
cursor.execute(cmd)
How should I best build the SQL insert command given this data?

The type real(4) does not mean an array/list/tuple of 4 reals; the 4 alters the 'real' type. However, SQLite mostly ignores column types due to its manifest typing, but they can still affect column affinity.
You have a few options, such as storing the text representation (from repr) or using four columns, one for each.
You can modify this with various hooks provided by the Python's SQLite library to handle some of the transformation for you, but separate columns (with functions to localize and handle various statements, so you don't repeat yourself) is probably the easiest to work with if you need to search/etc. in SQL on each value.
If you do store a text representation, ast.literal_eval (or eval, under special conditions) will convert back into a Python object.

Something like this:
db = sqlite3.connect("test.db")
cursor = db.cursor()
cursor.execute("insert into foo values (?,?,?,?)", (val1, val2, val3, val4))
db.commit() # Autocommit is off by default (and rightfully so)!
Please note, that I am not using string formatting to inject actual data into the query, but instead make the library do this work for me. That way the data is quoted and escaped correctly.
EDIT: Obviously, considering your database schema, it doesn't work. It is impractical to attempt to store a collection-type value in a single field of a sqlite database. If I understand you correctly, you should just create a separate column for every value you are storing in the single row. That will be a lot of columns, sure, but that's the most natural way to do it.

(A month later), two steps:
1. flatten e.g. DAS lats lons times to one long list, say 18 long
2. generate "Insert into tablename xx (?,?,... 18 question marks )" and execute that.
Test = 1
def flatten( *args ):
""" 1, (2,3), [4,5] -> [1 2 3 4 5] """
# 1 level only -- SO [python] [flatten] zzz
all = []
for a in args:
all.extend( a if hasattr( a, "__iter__" ) else [a] )
if Test: print "test flatten:", all
return all
def sqlinsert( db, tablename, *args ):
flatargs = flatten( *args ) # one long list
ncol = len(flatargs)
qmarks = "?" + (ncol-1) * ",?"
insert = "Insert into tablename %s values (%s)" % (tablename, qmarks)
if Test: print "test sqlinsert:", insert
if db:
db.execute( insert, flatargs )
# db.executemany( insert, map( flatargs, rows ))
return insert
#...............................................................................
if __name__ == "__main__":
print sqlinsert( None, "Table", "hidiho", (4,5), [6] )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Document Similarity: Comparing two documents efficiently - python

One sql query can do the job: SELECT sum(index1.tfidf_norm*index2.tfidf_norm) FROM index index1, index index2 WHERE index1.token=index2.token AND index1.doc_id=? AND index2.doc_id=? Just replace the '?' with the 2 document id respectively.

Related

Properly format SQL query when insert into variable number of columns

Python, SQlite3 - Querying condition equals and not in (Multiple)

Weird behavior with SQLite insert or replace

How to get the numbers of data rows from sqlite table in python

Inserting multiple types into an SQLite database with Python

Categories

Resources