How can I query in python both for where a condition equals a value i.e. r.user = (given user id) and where a value is NOT IN (given list of movie ids) the result set.
This is what I currently have
placeholder = '?' # For SQLite. See DBAPI paramstyle.
placeholders = ', '.join(placeholder * len(l))
query = 'SELECT r.user, r.movie, r.rating, m.title FROM ratings r JOIN movies m ON (r.movie = m.id) ' \
'WHERE r.user = 405 AND r.rating >= 3 AND r.movie NOT IN (%s)' % placeholders
cursor.execute(query, ('405', l))
movies_table = cursor.fetchall()
l refers to an array of values i.e. so I can get the result set where movie id is not in the list of values.
Thanks very much,
I'm currently able to get one or the other but not both due to what seems the number of parameters applied or so.
You need to call cursor.execute() with one item per-placeholder.
Try something like this:
cursor.execute(query, tuple(l))
If you want to append the 405 to the list of values, then you can do something like:
cursor.execute(query, (405, *l))
Related
I have this little code
cquery = "SELECT * FROM `workers` WHERE `Username` = (%s)"
cvalue = (usernameR,)
flash(cquery)
flash(cvalue)
x = c1.execute(cquery, cvalue)
flash(x)
usernameR is a string variable I got it's value from a form
x supposed to be the number of rows or some value but it returns none I need it's value for one if.
I tested it with a value that is in the table in one row so thats not the case the the value is not there or something. But if it's not there in that case the x should return 0 or something.
I cant work out what's the problem after several hours.
value of cvalue:
('Csabatron99',)
Edit for solution:
I needed to add the rowcount and fetchall to the code like this:
cquery = "SELECT * FROM `workers` WHERE `Username` = (%s)"
cvalue = (usernameR,)
flash(cquery)
flash(cvalue)
c1.execute(cquery, cvalue)
c1.fetchall()
a = c1.rowcount
cursor.execute() doesn't return anything in the normal case. If you use the multi=True argument, it returns an iterator used to get results from each of the multiple queries.
To get the number of rows returned by the query, use the rowcount attribute.
c1.execute(cquery, cvalue)
flash(c1.rowcount)
I'm using psycopg2 to interact with a PostgreSQL database. I have a function whereby any number of columns (from a single column to all columns) in a table could be inserted into. My question is: how would one properly, dynamically, construct this query?
At the moment I am using string formatting and concatenation and I know this is the absolute worst way to do this. Consider the below code where, in this case, my unknown number of columns (i.e. keys from a dict is in fact 2):
dictOfUnknownLength = {'key1': 3, 'key2': 'myString'}
def createMyQuery(user_ids, dictOfUnknownLength):
fields, values = list(), list()
for key, val in dictOfUnknownLength.items():
fields.append(key)
values.append(val)
fields = str(fields).replace('[', '(').replace(']', ')').replace("'", "")
values = str(values).replace('[', '(').replace(']', ')')
query = f"INSERT INTO myTable {fields} VALUES {values} RETURNING someValue;"
query = INSERT INTO myTable (key1, key2) VALUES (3, 'myString') RETURNING someValue;
This provides a correctly formatted query but is of course prone to SQL injections and the like and, as such, is not an acceptable method of achieving my goal.
In other queries I am using the recommended methods of query construction when handling a known number of variables (%s and separate argument to .execute() containing variables) but I'm unsure how to adapt this to accommodate an unknown number of variables without using string formatting.
How can I elegantly and safely construct a query with an unknown number of specified insert columns?
To add to your worries, the current methodology using .replace() is prone to edge cases where fields or values contain [, ], or '. They will get replaced no matter what and may mess up your query.
You could always use .join() to join a variable number of values in your list. To top it up, format the query appropriately with %s after VALUES and pass your arguments into .execute().
Note: You may also want to consider the case where the number of fields is not equal to the number values.
import psycopg2
conn = psycopg2.connect("dbname=test user=postgres")
cur = conn.cursor()
dictOfUnknownLength = {'key1': 3, 'key2': 'myString'}
def createMyQuery(user_ids, dictOfUnknownLength):
# Directly assign keys/values.
fields, values = list(dictOfUnknownLength.keys()), list(dictOfUnknownLength.values())
if len(fields) != len(values):
# Raise an error? SQL won't work in this case anyways...
pass
# Stringify the fields and values.
fieldsParam = ','.join(fields) # "key1, key2"
valuesParam = ','.join(['%s']*len(values))) # "%s, %s"
# "INSERT ... (key1, key2) VALUES (%s, %s) ..."
query = 'INSERT INTO myTable ({}) VALUES ({}) RETURNING someValue;'.format(fieldsParam, valuesParam)
# .execute('INSERT ... (key1, key2) VALUES (%s, %s) ...', [3, 'myString'])
cur.execute(query, values) # Anti-SQL-injection: pass placeholder
# values as second argument.
I am using pyodbc to update an Access database.
I need the functionality of an UPSERT.
ON DUPLICATE KEY UPDATE doesn't exist in Access SQL, and REPLACE is not an option since I want to keep other fields.
There are a lot of suggestions out there how to solve that, so this is
the solution which I put together:
for table_name in data_source:
table = data_source[table_name]
for y in table:
if table_name == "whatever":
SQL_UPDATE = "UPDATE {} set [Part Name] = '{}', [value] = {}, [code] = {}, [tolerance] = {} WHERE [Unique Part Number]='{}'".\
format(table_name,y['PartName'],y['Value'],y['keycode'],y['Tolerance'], y['PartNum'])
SQL_INSERT = "INSERT INTO {} ([Part Name],[Unique Part Number], [value], [code], [tolerance]) VALUES ('{}','{}','{}',{},{},{});".\
format(table_name,y['PartName'],y['PartNum'],y['Value'],y['keycode'],y['Tolerance'])
elsif ....
9 more tables....
res = cursor.execute(SQL_UPDATE)
if res.rowcount == 0:
cursor.execute(SQL_INSERT)
Well I have to say, I am not a Python expert, and I didn't manage to understand the fundamental concept nor the Magic of SQL,
so I can just Google things together here.
I don't like my above solution because it is very hard to read and difficult to maintain (I have to to this for ~10 different tables). The other point is that I have to use 2 queries because I didn't manage to understand and run any other UPSERT approach I found.
Does anyone have a recommendation for me how to do this in a smarter, better maintainable way?
As noted in this question and others, Access SQL does not have an "upsert" statement, so you will need to use a combination of UPDATE and INSERT. However, you can improve your current implementation by
using proper parameters for your query, and
using Python string manipulation to build the SQL command text.
For example, to upsert into a table named [Donor]
Donor ID Last Name First Name
-------- --------- ----------
1 Thompson Gord
You can start with a list of the field names. The trick here is to put the key field(s) at the end, so the INSERT and UPDATE statements will refer to the fields in the same order (i.e., the UPDATE will refer to the ID field last because it will be in the WHERE clause).
data_fields = ['Last Name', 'First Name']
key_fields = ['Donor ID']
The parameter values will be the same for both the UPDATE and INSERT cases, e.g.
params = ('Elk', 'Anne', 2)
The UPDATE statement can be constructed like this
update_set = ','.join(['[' + x + ']=?' for x in data_fields])
update_where = ' AND '.join(['[' + x + ']=?' for x in key_fields])
sql_update = "UPDATE [Donor] SET " + update_set + " WHERE " + update_where
print(sql_update)
which shows us
UPDATE [Donor] SET [Last Name]=?,[First Name]=? WHERE [Donor ID]=?
Similarly, the INSERT statement can be constructed like this
insert_fields = ','.join(['[' + x + ']' for x in (data_fields + key_fields)])
insert_placeholders = ','.join(['?' for x in (data_fields + key_fields)])
sql_insert = "INSERT INTO [Donor] (" + insert_fields + ") VALUES (" + insert_placeholders + ")"
print(sql_insert)
which prints
INSERT INTO [Donor] ([Last Name],[First Name],[Donor ID]) VALUES (?,?,?)
So, to perform our upsert, all we need to do is
crsr.execute(sql_update, params)
if crsr.rowcount > 0:
print('Existing row updated.')
else:
crsr.execute(sql_insert, params)
print('New row inserted.')
crsr.commit()
Consider using parameterized queries from prepared statements that uses ? placeholders. The str.format is still needed for identifiers such as table and field names. Then unpack dictionary items with zip(*dict.items()) to pass as parameters in the cursor's execute call: cursor.execute(query, params).
for table_name in data_source:
table = data_source[table_name]
for y in table:
keys, values = zip(*y.items()) # UNPACK DICTIONARY INTO TWO TUPLES
if table_name == "whatever":
SQL_UPDATE = "UPDATE {} set [Part Name] = ?, [value] = ?, [code] = ?," + \
" [tolerance] = ? WHERE [Unique Part Number]= ?".format(table_name)
SQL_INSERT = "INSERT INTO {} ([Part Name], [Unique Part Number], [value]," + \
" [code], [tolerance]) VALUES (?, ?, ?, ?, ?);".format(table_name)
res = cursor.execute(SQL_UPDATE, values)
if res.rowcount == 0:
cursor.execute(SQL_INSERT, values)
...
I have the following query
cursor.execute(
"""
SELECT transform(row_to_json(t)) FROM
(select * from table
where a = %s
and b = %s limit 1000) t;
"""
, (a_value, b_value))
Running records = cursor.fetchall() will return a list of size 1 tuples.
Is there anyway to return just a list of lists?
I am asking this because I'd like to transform the list of lists into a numpy matrix, and for looping through to turn the singleton tuples into a list is slow.
When you have more then one rows you can use the following code
result = [r[0] for r in cur.fetchall()]
As a quick fix you can return an array:
cursor.execute("""
select array_agg(transform(row_to_json(t)))
from (
select * from table
where a = %s and b = %s
limit 1000
) t;
""", (a_value, b_value))
As Psycopg adapts Postgresql arrays to Python lists then just get that list:
records = cursor.fetchall()[0][0]
I guess it is possible to subclass cursor to return lists in instead of tuples but if you are not dealing with huge sets I think it is not worth the trouble.
you may also use this code
result = cur.fetchall()
x = map(list, list(result))
x = sum(x, [])
I have a loop that calculates the similarity between two documents. It collects all the tokens in a document and their scores, and places them in dictionary. It then compares the dictionaries
This is what I have so far, it works, but is super slow:
# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)
#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)
# loop through each token in doca and see if one matches in docb
for x in doca_dic:
if docb_dic.has_key(x):
#calculate the similarity by summing the products of the tf-idf_norm
similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity
I'm pretty new to Python, hence this mess. I need to speed it up, any help would be appreciated.
Thanks.
A Python point: adict.has_key(k) is obsolete in Python 2.X and vanished in Python 3.X. k in adict as an expression has been available since Python 2.2; use it instead. It will be faster (no method call).
An any-language practical point: iterate over the shorter dictionary.
Combined result:
if len(doca_dic) < len(docb_dict):
short_dict, long_dict = doca_dic, docb_dic
else:
short_dict, long_dict = docb_dic, doca_dic
similarity = 0
for x in short_dict:
if x in long_dict:
#calculate the similarity by summing the products of the tf-idf_norm
similarity += short_dict[x] * long_dict[x]
And if you don't need the two dictionaries for anything else, you could create only the A one and iterate over the B (key, value) tuples as they pop out of your B query. After the docb = cursor2.fetchall(), replace all following code by this:
similarity = 0
for b_token, b_value in docb:
if b_token in doca_dic:
similarity += doca_dic[b_token] * b_value
Alternative to the above code: This is doing more work but it's doing more of the iterating in C instead of Python and may be faster.
similarity = sum(
doca_dic[k] * docb_dic[k]
for k in set(doca_dic) & set(docb_dic)
)
Final version of the Python code
# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
# Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
if len(doca) < len(docb):
short_doc, long_doc = doca, docb
else:
short_doc, long_doc = docb, doca
long_dict = dict(long_doc) # yes, it should be that simple
similarity = 0
for key, value in short_doc:
if key in long_dict:
similarity += long_dict[key] * value
Another practical point: you haven't said which part of it is slow ... working on the dicts or doing the selects? Put some calls of time.time() into your script.
Consider pushing ALL the work onto the database. Following example uses a hardwired SQLite query but the principle is the same.
C:\junk\so>sqlite3
SQLite version 3.6.14
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table atable(docid text, token text, score float,
primary key (docid, token));
sqlite> insert into atable values('a', 'apple', 12.2);
sqlite> insert into atable values('a', 'word', 29.67);
sqlite> insert into atable values('a', 'zulu', 78.56);
sqlite> insert into atable values('b', 'apple', 11.0);
sqlite> insert into atable values('b', 'word', 33.21);
sqlite> insert into atable values('b', 'zealot', 11.56);
sqlite> select sum(A.score * B.score) from atable A, atable B
where A.token = B.token and A.docid = 'a' and B.docid = 'b';
1119.5407
sqlite>
And it's worth checking that the database table is appropriately indexed (e.g. one on token by itself) ... not having a usable index is a good way of making an SQL query run very slowly.
Explanation: Having an index on token may make either your existing queries or the "do all the work in the DB" query or both run faster, depending on the whims of the query optimiser in your DB software and the phase of the moon. If you don't have a usable index, the DB will read ALL the rows in your table -- not good.
Creating an index: create index atable_token_idx on atable(token);
Dropping an index: drop index atable_token_idx;
(but do consult the docs for your DB)
What about pushing some of the work on the DB?
With a join you can have a result that is basically
Token A.tfidf_norm B.tfidf_norm
-----------------------------------------
Apple 12.2 11.00
...
Word 29.87 33.21
Zealot 0.00 11.56
Zulu 78.56 0.00
And you just have to scan the cursor and do your operations.
If you don't need to know if one word is in one document and missing in the other one you don't need an outer join, and the list will be the intersection of the two sets. The example I included above assigns automatically a "0" for words missing from one of the two documents. See what your "matching" functions requires.
One sql query can do the job:
SELECT sum(index1.tfidf_norm*index2.tfidf_norm) FROM index index1, index index2 WHERE index1.token=index2.token AND index1.doc_id=? AND index2.doc_id=?
Just replace the '?' with the 2 document id respectively.