I'm trying to extract email addresses from text in the column alltext and update the column email with the list of emails found in alltext. The datatype for email is a string array (i.e. text[]).
1) I'm getting the following error and can't seem to find a way around it:
psycopg2.ProgrammingError: syntax error at or near "["
LINE 1: UPDATE comments SET email=['person#email.com', 'other#email.com']
2) Is there a more efficient way to be doing this in the first place? I've experimented some with the PostgreSQL regex documentation but a lot of people seem to think it's not great for this purpose.
def getEmails():
'''Get emails from alltext.
'''
DB = psycopg2.connect("dbname=commentDB")
c = DB.cursor()
c.execute("SELECT id, alltext FROM comments WHERE id < 100")
for row in c:
match = re.findall(r'[\w\.-]+#[\w\.-]+', str(row[1]))
data = {'id':int(row[0]), 'email':match}
c.execute("UPDATE comments SET email=%(email)s WHERE id=%(id)s" % data)
DB.commit()
DB.close()
execute should be passed a list for unnamed arguments, or dict -- as in this case -- for named arguments, as a second argument to ensure that it is psycopg2 (via libpq) that is doing all the proper escaping. You are using native Python string interpolation, which is subject to SQL Injection, and leading to this error, since it isn't libpq doing the interpolation.
Also, as an aside, your regex won't capture various types of email addresses. One type that immediately comes to mind is the form foo+bar#loopback.edu. The + is technically allowed, and can be used, for example, for filtering email. See this link for more details as to issues that crop up with using regexes for validating/parsing email addresses.
In short, the above link recommends using this regex:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
With the caveat that it is valid for only what the author claims is valid email address. Still, it's probably a good jumping-off point, and can be adjusted if you have specific cases that differ.
Edit in response to comment from OP:
The execute line from above would become:
c.execute("UPDATE comments SET email=%(email)s WHERE id=%(id)s", data)
Note that data is now a second argument to execute as opposed to being part of an interpolation operation. This means that Psycopg2 will handle the interpolation and not only avoid the SQL Injection issue, but also properly interpret how the dict should be interpolated into the query string.
Edit in response to follow-up comment from OP:
Yes, the subsequent no results to fetch error is likely because you are using the same cursor. Since you are iterating over the current cursor, trying to use it again in the for loop to do an update interferes with the iteration.
I would declare a new cursor inside the for loop and use that.
Related
I'm to link my code to a MySQL database using pymysql. In general everything has gone smoothly but I'm having difficulty with the following function to find the minimum of a variable column.
def findmin(column):
cur = db.cursor()
sql = "SELECT MIN(%s) FROM table"
cur.execute(sql,column)
mintup = cur.fetchone()
if everything went smoothly this would return me a tuple with the minimum, e.g. (1,).
However, if I run the function:
findmin(column_name)
I have to put column name in "" (i.e. "column_name"), else Python sees it as an unknown variable. But if I put the quotation marks around column_name then SQL sees
SELECT MIN("column_name") FROM table
which just returns the column header, not the value.
How can I get around this?
The issue is likely the use of %s for the column name. That means the SQL Driver will try to escape that variable when interpolating it, including quoting, which is not what you want for things like column names, table names, etc.
When using a value in SELECT, WHERE, etc. then you do want to use %s to prevent SQL injections and enable quoting, among other things.
Here, you just want to interpolate using pure Python (assuming a trusted value; please see below for more information). That also means no bindings tuple passed to the execute method.
def findmin(column):
cur = db.cursor()
sql = "SELECT MIN({0}) FROM table".format(column)
cur.execute(sql)
mintup = cur.fetchone()
SQL fiddle showing the SQL working:
http://sqlfiddle.com/#!2/e70a41/1
In response to the Jul 15, 2014 comment from Colin Phipps (September 2022):
The relatively recent edit on this post by another community member brought it to my attention, and I wanted to respond to Colin's comment from many years ago.
I totally agree re: being careful about one's input if one interpolates like this. Certainly one needs to know exactly what is being interpolated. In this case, I would say a defined value within a trusted internal script or one supplied by a trusted internal source would be fine. But if, as Colin mentioned, there is any external input, then that is much different and additional precautions should be taken.
This question already has answers here:
How do you escape strings for SQLite table/column names in Python?
(8 answers)
Closed 7 years ago.
I have a wide table in a sqlite3 database, and I wish to dynamically query certain columns in a Python script. I know that it's bad to inject parameters by string concatenation, so I tried to use parameter substitution instead.
I find that, when I use parameter substitution to supply a column name, I get unexpected results. A minimal example:
import sqlite3 as lite
db = lite.connect("mre.sqlite")
c = db.cursor()
# Insert some dummy rows
c.execute("CREATE TABLE trouble (value real)")
c.execute("INSERT INTO trouble (value) VALUES (2)")
c.execute("INSERT INTO trouble (value) VALUES (4)")
db.commit()
for row in c.execute("SELECT AVG(value) FROM trouble"):
print row # Returns 3
for row in c.execute("SELECT AVG(:name) FROM trouble", {"name" : "value"}):
print row # Returns 0
db.close()
Is there a better way to accomplish this than simply injecting a column name into a string and running it?
As Rob just indicated in his comment, there was a related SO post that contains my answer. These substitution constructions are called "placeholders," which is why I did not find the answer on SO initially. There is no placeholder pattern for column names, because dynamically specifying columns is not a code safety issue:
It comes down to what "safe" means. The conventional wisdom is that
using normal python string manipulation to put values into your
queries is not "safe". This is because there are all sorts of things
that can go wrong if you do that, and such data very often comes from
the user and is not in your control. You need a 100% reliable way of
escaping these values properly so that a user cannot inject SQL in a
data value and have the database execute it. So the library writers do
this job; you never should.
If, however, you're writing generic helper code to operate on things
in databases, then these considerations don't apply as much. You are
implicitly giving anyone who can call such code access to everything
in the database; that's the point of the helper code. So now the
safety concern is making sure that user-generated data can never be
used in such code. This is a general security issue in coding, and is
just the same problem as blindly execing a user-input string. It's a
distinct issue from inserting values into your queries, because there
you want to be able to safely handle user-input data.
So, the solution is that there is no problem in the first place: inject the values using string formatting, be happy, and move on with your life.
Why not use string formatting?
for row in c.execute("SELECT AVG({name}) FROM trouble".format(**{"name" : "value"})):
print row # => (3.0,)
I have wrote a query which has some string replacements. I am trying to update a url in a table but the url has % signs in which causes a tuple index out of range exception.
If I print the query and run in manually it works fine but through peewee causes an issue. How can I get round this? I'm guessing this is because the percentage signs?
query = """
update table
set url = '%s'
where id = 1
""" % 'www.example.com?colour=Black%26white'
db.execute_sql(query)
The code you are currently sharing is incredibly unsafe, probably for the same reason as is causing your bug. Please do not use it in production, or you will be hacked.
Generally: you practically never want to use normal string operations like %, +, or .format() to construct a SQL query. Rather, you should to use your SQL API/ORM's specific built-in methods for providing dynamic values for a query. In your case of SQLite in peewee, that looks like this:
query = """
update table
set url = ?
where id = 1
"""
values = ('www.example.com?colour=Black%26white',)
db.execute_sql(query, values)
The database engine will automatically take care of any special characters in your data, so you don't need to worry about them. If you ever find yourself encountering issues with special characters in your data, it is a very strong warning sign that some kind of security issue exists.
This is mentioned in the Security and SQL Injection section of peewee's docs.
Wtf are you doing? Peewee supports updates.
Table.update(url=new_url).where(Table.id == some_id).execute()
I'm to link my code to a MySQL database using pymysql. In general everything has gone smoothly but I'm having difficulty with the following function to find the minimum of a variable column.
def findmin(column):
cur = db.cursor()
sql = "SELECT MIN(%s) FROM table"
cur.execute(sql,column)
mintup = cur.fetchone()
if everything went smoothly this would return me a tuple with the minimum, e.g. (1,).
However, if I run the function:
findmin(column_name)
I have to put column name in "" (i.e. "column_name"), else Python sees it as an unknown variable. But if I put the quotation marks around column_name then SQL sees
SELECT MIN("column_name") FROM table
which just returns the column header, not the value.
How can I get around this?
The issue is likely the use of %s for the column name. That means the SQL Driver will try to escape that variable when interpolating it, including quoting, which is not what you want for things like column names, table names, etc.
When using a value in SELECT, WHERE, etc. then you do want to use %s to prevent SQL injections and enable quoting, among other things.
Here, you just want to interpolate using pure Python (assuming a trusted value; please see below for more information). That also means no bindings tuple passed to the execute method.
def findmin(column):
cur = db.cursor()
sql = "SELECT MIN({0}) FROM table".format(column)
cur.execute(sql)
mintup = cur.fetchone()
SQL fiddle showing the SQL working:
http://sqlfiddle.com/#!2/e70a41/1
In response to the Jul 15, 2014 comment from Colin Phipps (September 2022):
The relatively recent edit on this post by another community member brought it to my attention, and I wanted to respond to Colin's comment from many years ago.
I totally agree re: being careful about one's input if one interpolates like this. Certainly one needs to know exactly what is being interpolated. In this case, I would say a defined value within a trusted internal script or one supplied by a trusted internal source would be fine. But if, as Colin mentioned, there is any external input, then that is much different and additional precautions should be taken.
I used MySQL Connector/Python API, NOT MySQLdb.
I need to dynamically insert values into a sparse table so I wrote the Python code like this:
cur.executemany("UPDATE myTABLE SET %s=%s WHERE id=%s" % data)
where
data=[('Depth', '17.5cm', Decimal('3003')), ('Input_Voltage', '110 V AC', Decimal('3004'))]
But it resulted an error:
TypeError: not enough arguments for format string
Is there any solution for this problem? Is it possible to use executemany when there is a
substitution of a field in query?
Thanks.
Let's start with the original method:
As the error message suggests you have a problem with your SQL syntax (not Python). If you insert your values you are effectively trying to execute
UPDATE myTABLE SET 'Depth'='17.5cm' WHERE id='3003'
You should notice that you are trying to assign a value to a string 'Depth', not a database field. The reason for this is that the %s substitution of the mysql module is only possible for values, not for tables/fields or other object identifiers.
In the second try you are not using the substitution anymore. Instead you use generic python string interpolation, which however looks similar. This does not work for you because you have a , and a pair of brackets too much in your code. It should read:
cur.execute("UPDATE myTABLE SET %s=%s WHERE id=%s" % data)
I also replaced executemany with execute because this method will work only for a single row. However your example only has one row, so there is no need to use executemany anyway.
The second method has some drawbacks however. The substitution is not guaranteed to be quoted or formatted in a correct manner for the SQL query, which might cause unexpected behaviour for certain inputs and may be a security concern.
I would rather ask, why it is necessary to provide the field name dynamically in the first place. This should not be necessary and might cause some trouble.