Weird behavior with SQLite insert or replace - python

I am trying to increment the count of a row in an SQLite database if the row exists, or add a new row if it doesn't exist the way it is done in this SO post. However I'm getting some weird behavior when I try to execute this SQL proc many times in quick succession. As an example I tried running this code:
db = connect_to_db()
c = db.cursor()
for i in range(10):
c.execute("INSERT OR REPLACE INTO subject_words (subject, word, count) VALUES ('subject', 'word', COALESCE((SELECT count + 1 FROM subject_words WHERE subject = 'subject' AND word = 'word'), 1));")
db.commit()
db.close()
And it inserted the following into the database
sqlite> select * from subject_words;
subject|word|1
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
subject|word|2
Which totals to 19 entries of the word 'word' with subject 'subject'. Can anyone explain this weird behavior?

I don't think you've understood what INSERT OR REPLACE actually does. The REPLACE clause would only come into play if it was not possible to do the insertion, because a unique constraint was violated. An example might be if your subject column was the primary key.
However, without any primary keys or other constraints, there's nothing being violated by inserting multiple rows with the same subject; so there's no reason to invoke the REPLACE clause.

That operation is much easier to write and understand if you do it with two SQL statements:
c.execute("""UPDATE subject_words SET count = count + 1
WHERE subject = ? AND WORD = ?""",
['subject', 'word'])
if c.rowcount == 0:
c.execute("INSERT INTO subject_words (subject, word, count) VALUES (?,?,?)",
['subject', 'word', 1])
This does not require a UNIQUE constraint on the columns you want to check, and is not any less efficient.

Related

MySQL & Python select last corresponding row instead of first

I had a quick question. I want to count how many times a user is logged into the system. To achieve this i add a 1 to the third part of the result. The only thing is that every time the user logs in the code fetches the first corresponding row. Thus resulting in the fact that the login_num will always be 2, since the first corresponding row always contains a 1.
On Stackoverflow i searched for several solutions. So i came up with the DESC at the end of the fetch syntax. However in every instance i tried this, i always end up getting an error in return. Does anyone have an idea why this is the case?
Python code:
cursor.execute("Select rfid_uid, name, login_num FROM users rfid_uid="+str(id) + "ORDER BY id DESC")
result = cursor.fetchone()
if cursor.rowcount >= 1:
print("Welkom " + result[1])
print(result)
result = (result[0], result[1], result[2] + 1)
sql_insert = "INSERT INTO users (rfid_uid, name, login_num) VALUES (%s, %s, %s)"
cursor.execute(sql_insert, (result))
db.commit()
Seems your SQL statement refers to table 'users'. I suppose it does contain info about users in general (a row per user), not user logins.
If you have each individual user login event registered in some table, I would let the database do the counting. Something like this:
SELECT COUNT(*) FROM user_logins WHERE rfid_uid='user_id';
You should get one row, which has your answer as an integer.

Can I use parameter insertion to specify column names for MySQL queries?

I want to know if it is possible to use parameter insertion for column names into MySQL queries using Python.
Consider the following two queries, both of which are passed to MySQLCursor.execute(). The first:
query = (
'SELECT username, COUNT(*)'
'FROM `entry`'
'GROUP BY username;'
)
cursor.execute(query)
And the second:
query = (
'SELECT %s, COUNT(*)'
'FROM `entry`'
'GROUP BY %s;'
)
data = ('username', 'username')
cursor.execute(query, data)
The first of these returns the results I expect (a count of each how many times a distinct value appears in the username column) and the second returns unexpected results, specifically [(u'username', n)] where n is the total number of rows in the database.
The problem in the second query is that the parameters are interpreted as a string by the query. Is there a way to insert them such that they can be interpreted as a non-string? I want to do this in a way that is safe from Injection attacks.
It is not recommended to use the 2nd syntax, the 1st one is ok and safe.

Insert list of dictionaries and variable into table

lst = [{'Fruit':'Apple','HadToday':2},{'Fruit':'Banana','HadToday':8}]
I have a long list of dictionaries of the form above.
I have two fixed variables.
person = 'Sam'
date = datetime.datetime.now()
I wish to insert this information into a mysql table.
How I do it currently
for item in lst:
item['Person'] = person
item['Date'] = date
cursor.executemany("""
INSERT INTO myTable (Person,Date,Fruit,HadToday)
VALUES (%(Person)s, %(Date)s, %(Fruit)s, %(HadToday)s)""", lst)
conn.commit()
Is their a way to do it, that bypasses the loop as the person and date variables are constant. I have tried
lst = [{'Fruit':'Apple','HadToday':2},{'Fruit':'Banana','HadToday':8}]
cursor.executemany("""
INSERT INTO myTable (Person,Date,Fruit,HadToday)
VALUES (%s, %s, %(Fruit)s, %(HadToday)s)""", (person,date,lst))
conn.commit()
TypeError: not enough arguments for format string
Your problem here is, that it tries to apply all of lst into %(Fruit)s and nothing is left for %(HadToday)s).
You should not fix it by hardcoding the fixed values into the statement as you get into troubles if you have a name like "Tim O'Molligan" - its better to let the db handle the correct formatting.
Not mysql, but you get the gist: http://initd.org/psycopg/docs/usage.html#the-problem-with-the-query-parameters - learned this myself just a week ago ;o)
The probably cleanest way would be to use
cursor.execute("SET #myname = %s", (person,))
cursor.execute("SET #mydate = %s", (datetime.datetime.now(),))
and use
cursor.executemany("""
INSERT INTO myTable (Person,Date,Fruit,HadToday)
VALUES (#myname, #mydate, %(Fruit)s, %(HadToday)s)""", lst)
I am not 100% about the syntax, but I hope you get the idea. Comment/edit the answer if I have a misspell in it.

Problems INSERTing record if similar doesn't already exist

I'm trying to check whether a record already exists in the database (by similar title), and insert it if not. I've tried it two ways and neither quite works.
More elegant way (?) using IF NOT EXISTS
if mode=="update":
#check if book is already present in the system
cursor.execute('IF NOT EXISTS (SELECT * FROM book WHERE TITLE LIKE "%s") INSERT INTO book (title,author,isbn) VALUES ("%s","%s","%s") END IF;' % (title,title,author,isbn))
cursor.execute('SELECT bookID FROM book WHERE TITLE LIKE "%s";' % (title))
bookID = cursor.fetchall()
print('found the bookid %s' % (bookID))
#cursor.execute('INSERT INTO choice (uid,catID,priority,bookID) VALUES ("%d","%s","%s","%s");' % ('1',cat,priority,bookID)) #commented out because above doesn't work
With this, I get an error on the IF NOT EXISTS query saying that "author" isn't defined (although it is).
Less elegant way using count of matching records
if mode=="update":
#check if book is already present in the system
cursor.execute('SELECT COUNT(*) FROM book WHERE title LIKE "%s";' % (title))
anyresults = cursor.fetchall()
print('anyresults looks like %s' % (anyresults))
if anyresults[0] == 0: # if we didn't find a bookID
print("I'm in the loop for adding a book")
cursor.execute('INSERT INTO book (title,author,isbn) VALUES ("%s","%s","%s");' % (title,author,isbn))
cursor.execute('SELECT bookID FROM book WHERE TITLE LIKE "%s";' % (title))
bookID = cursor.fetchall()
print('found the bookid %s' % (bookID))
#cursor.execute('INSERT INTO choice (uid,catID,priority,bookID) VALUES ("%d","%s","%s","%s");' % ('1',cat,priority,bookID)) #commented out because above doesn't work
In this version, anyresults is a tuple that looks like (0L,) but I can't find a way of matching it that gets me into that "loop for adding a book." if anyresults[0] == 0, 0L, '0', '0L' -- none of these seem to get me into the loop.
I think I may not be using IF NOT EXISTS correctly--examples I've found are for separate procedures, which aren't really in the scope of this small project.
ADDITION:
I think unutbu's code will work great, but I'll still getting this dumb NameError saying author is undefined which prevents the INSERT from being tried, even when I am definitely passing it in.
if form.has_key("title"):
title = form['title'].value
mode = "update"
if form.has_key("author"):
author = form['author'].value
mode = "update"
print("I'm in here")
if form.has_key("isbn"):
isbn = form['isbn'].value
mode = "update"
It never prints that "I'm in here" test statement. What would stop it getting in there? It seems so obvious--I keep checking my indentation, and I'm testing it on the command line and definitely specifying all three parameters.
If you set up a UNIQUE index on book, then inserting unique rows is easy.
For example,
mysql> ALTER IGNORE TABLE book ADD UNIQUE INDEX book_index (title,author);
WARNING: if there are rows with non-unique (title,author) pairs, all but one such row will be dropped.
If you want just the author field to be unique, then just change (title,author) to (author).
Depending on how big the table, this may take a while...
Now, to insert a unique record,
sql='INSERT IGNORE INTO book (title,author,isbn) VALUES (%s, %s, %s)'
cursor.execute(sql,[title,author,isbn])
If (title,author) are unique, the triplet (title,author,isbn) is inserted into the book table.
If (title,author) are not unique, then the INSERT command is ignored.
Note, the second argument to cursor.execute. Passing arguments this way helps prevent SQL injection.
This doesn't answer your question since it's for Postgresql rather than MySQL, but I figured I'd drop it in for people searching their way here.
In Postgres, you can batch insert items if they don't exist:
CREATE TABLE book (title TEXT, author TEXT, isbn TEXT);
# Create a row of test data:
INSERT INTO book (title,author,isbn) VALUES ('a', 'b', 'c');
# Do the real batch insert:
INSERT INTO book
SELECT add.* FROM (VALUES
('a', 'b', 'c'),
('d', 'e', 'f'),
('g', 'h', 'i'),
) AS add (title, author, isbn)
LEFT JOIN book ON (book.title = add.title)
WHERE book.title IS NULL;
This is pretty simple. It selects the new rows as if they're a table, then left joins them against the existing data. The rows that don't already exist will join against a NULL row; we then filter out the ones that already exist (where book.title isn't NULL). This is extremely fast: it takes only a single database transaction to do a large batch of inserts, and lets the database backend do a bulk join, which it's very good at.
By the way, you really need to stop formatting your SQL queries directly (unless you really have to and really know what you're doing, which you don't here). Use query substitution, eg. cur.execute("SELECT * FROM table WHERE title=? and isbn=?", (title, isbn)).

Document Similarity: Comparing two documents efficiently

I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries
This is what I have so far, it works, but is super slow:
# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)
#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)
# loop through each token in doca and see if one matches in docb
for x in doca_dic:
if docb_dic.has_key(x):
#calculate the similarity by summing the products of the tf-idf_norm
similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity
I'm pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated.
Thanks.
A Python point: adict.has_key(k) is obsolete in Python 2.X and vanished in Python 3.X. k in adict as an expression has been available since Python 2.2; use it instead. It will be faster (no method call).
An any-language practical point: iterate over the shorter dictionary.
Combined result:
if len(doca_dic) < len(docb_dict):
short_dict, long_dict = doca_dic, docb_dic
else:
short_dict, long_dict = docb_dic, doca_dic
similarity = 0
for x in short_dict:
if x in long_dict:
#calculate the similarity by summing the products of the tf-idf_norm
similarity += short_dict[x] * long_dict[x]
And if you don't need the two dictionaries for anything else, you could create only the A one and iterate over the B (key, value) tuples as they pop out of your B query. After the docb = cursor2.fetchall(), replace all following code by this:
similarity = 0
for b_token, b_value in docb:
if b_token in doca_dic:
similarity += doca_dic[b_token] * b_value
Alternative to the above code: This is doing more work but it's doing more of the iterating in C instead of Python and may be faster.
similarity = sum(
doca_dic[k] * docb_dic[k]
for k in set(doca_dic) & set(docb_dic)
)
Final version of the Python code
# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
# Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
if len(doca) < len(docb):
short_doc, long_doc = doca, docb
else:
short_doc, long_doc = docb, doca
long_dict = dict(long_doc) # yes, it should be that simple
similarity = 0
for key, value in short_doc:
if key in long_dict:
similarity += long_dict[key] * value
Another practical point: you haven't said which part of it is slow ... working on the dicts or doing the selects? Put some calls of time.time() into your script.
Consider pushing ALL the work onto the database. Following example uses a hardwired SQLite query but the principle is the same.
C:\junk\so>sqlite3
SQLite version 3.6.14
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table atable(docid text, token text, score float,
primary key (docid, token));
sqlite> insert into atable values('a', 'apple', 12.2);
sqlite> insert into atable values('a', 'word', 29.67);
sqlite> insert into atable values('a', 'zulu', 78.56);
sqlite> insert into atable values('b', 'apple', 11.0);
sqlite> insert into atable values('b', 'word', 33.21);
sqlite> insert into atable values('b', 'zealot', 11.56);
sqlite> select sum(A.score * B.score) from atable A, atable B
where A.token = B.token and A.docid = 'a' and B.docid = 'b';
1119.5407
sqlite>
And it's worth checking that the database table is appropriately indexed (e.g. one on token by itself) ... not having a usable index is a good way of making an SQL query run very slowly.
Explanation: Having an index on token may make either your existing queries or the "do all the work in the DB" query or both run faster, depending on the whims of the query optimiser in your DB software and the phase of the moon. If you don't have a usable index, the DB will read ALL the rows in your table -- not good.
Creating an index: create index atable_token_idx on atable(token);
Dropping an index: drop index atable_token_idx;
(but do consult the docs for your DB)
What about pushing some of the work on the DB?
With a join you can have a result that is basically
Token A.tfidf_norm B.tfidf_norm
-----------------------------------------
Apple 12.2 11.00
...
Word 29.87 33.21
Zealot 0.00 11.56
Zulu 78.56 0.00
And you just have to scan the cursor and do your operations.
If you don't need to know if one word is in one document and missing in the other one you don't need an outer join, and the list will be the intersection of the two sets. The example I included above assigns automatically a "0" for words missing from one of the two documents. See what your "matching" functions requires.
One sql query can do the job:
SELECT sum(index1.tfidf_norm*index2.tfidf_norm) FROM index index1, index index2 WHERE index1.token=index2.token AND index1.doc_id=? AND index2.doc_id=?
Just replace the '?' with the 2 document id respectively.

Categories

Resources