I'm using web.py to create a simple report page from Oracle. When I take the best practice approach of using vars= to pass parameters, the delay is 11-12 seconds. When I do the same query using string substitution, the query runs in less than a second. Here's how I'm checking:
sql = """
SELECT a, b, c
FROM my_table
WHERE what = $what
ORDER BY a, b"""
print('before', datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))
result = db.query(sql, vars={'what': '1234'})
print('after', datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))
The "before" and "after" show clearly than I'm getting a massive delay on the query. I've tried using select() with the same vars= and I get the same delay. So, my initial suggestion is that it's web.db's SQL escape functions that are creating the delay. I don't want to pass unescaped input, and it doesn't seem like there should be so much overhead.
Is there anything else that could be creating this delay? If it is the escape logic, are there any gotchas of which I need to be aware?
Thanks in advance!
EDIT:
On further investigation, I've proven (to myself, at least) that the delay is not specific to web.py, but happens in cx_Oracle. I reached this conclusion modifying my sql for syntax and doing:
cursor = db._db_cursor()
lines = cursor.execute(sql.format(my_table, {'what': '1234'})
...this produces a similar ten-to-twelve second delay compared to hard-coding the variable. Any cx_Oracle advice?
Alrighty. On further, further investigation, I figured out that the problem is a mismatch in encoding between Python and Oracle for the string parameter I pass. Fixed it with a simple what.encode('iso-8859-1'). If that doesn't work for you, check the Oracle encoding using the PL/SQL dump() function. If the Python encoding doesn't work, try decoding first.
Related
I'm currently reviewing someone's code, and I ran into the following Python line:
db.query('''SELECT foo FROM bar WHERE id = %r''' % id)
This goes against my common sense, because I would usually opt-in to use prepared statements, or at the very least use the database system's native string escaping function.
However, I am still curious how this could be exploited, given that:
The 'id' value is a string or number that's provided by an end-user/pentester
This is MySQL
The connection is explicitly set to use UTF8.
Python drivers for MySQL don't support real prepared statements. They all do some form of string-interpolation. The trick is to get Python to do the string-interpolation with proper escaping.
See a demonstration of doing it unsafely: How do PyMySQL prevent user from sql injection attack?
The conventional solution to simulate parameters is the following:
sql = "SELECT foo FROM bar WHERE id = %s"
cursor.execute(sql, (id,))
See https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-execute.html
The only ways I know to overcome escaping (when it is done correctly) are:
Exploit GBK or SJIS or similar character sets, where an escaped quote becomes part of a multi-byte character. By ensuring to set names utf8, you should be safe from this issue.
Change the sql_mode to break the escaping, like enable NO_BACKSLASH_ESCAPES or ANSI_QUOTES. You should set sql_mode at the start of your session, similar to how you set names. This will ensure it isn't using a globally changed sql_mode that causes a problem.
See also Is "mysqli_real_escape_string" enough to avoid SQL injection or other SQL attacks?
I have to carry out some statistical treatments on data that is stored in PostgreSQL tables. I have been hesitating between using R and Python.
With R I use the following code:
require("RPostgreSQL")
(...) #connection to the database, etc
my_table <- dbGetQuery(con, "SELECT * FROM some_table;")
which is very fast : it will take only 5 seconds to fetch a table with ~200 000 lines and 15 columns and almost no NULL's in it.
With Python, I use the following code:
import psycopg2
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
cursor.execute("SELECT * FROM some_table;")
my_table = cursor.fetchall()
and surprisingly, it causes my Python session to freeze and my computer to crash.
As I use these librairies as "black boxes", I don't understand why something that is so quick in R can be that slow (and thus almost impossible for a practical use) in Python.
Can someone explain this difference in performance, and can someone tell if there exists a more efficient method to fetch a pgSQL table in Python?
I am no expert in R but very obviously what dbGetQuery() (actually : what dbFetch()) returns is a lazy object that will not load all results in memory - else it would of course take ages and eat all your ram too.
wrt/ Python / psycopg, you definitly DONT want to fetchall() a huge dataset. The proper solution here is to use a server-side cursor and iterate over it.
Edit - answering the questions in your comments:
so the option cursor_factory=psycopg2.extras.DictCursor when executing fetchall()does the trick, right?
Not at all. As written in all letter in the example I likned to, what "does the trick" is using a server side cursor, which is done (in psycopg) by naming the cursor:
HERE IS THE IMPORTANT PART, by specifying a name for the cursor
psycopg2 creates a server-side cursor, which prevents all of the
records from being downloaded at once from the server.
cursor = conn.cursor('cursor_unique_name')
The DictCursor stuff is actually irrelevant (and should not be mentionned in this example since it obviously confuses newcomers).
I have a side question regarding the concept of lazy object (the one returned in R). How is it possible to return the object as a data-frame without storing it in my RAM? I find it a bit magical.
As I mentionned I don't zilch about R and it's implementation - I deduce that whatever dbFetch returns is a lazy object from the behaviour you describe -, but there's nothing magical in having an object that lazily fetches values from an external source. Python's file object is a known example:
with open("/some/huge/file.txt") as f:
for line in f:
print line
In the above snippet, the file object f fetches data from disk only when needed. All that needs to be stored is the file pointer position (and a buffer of the last N bytes that were read from disk, but that's an implementation detail).
If you want to learn more, read about Python's iteratable and iterator.
I have noticed a huge timing difference between using django connection.cursor vs using the model interface, even with small querysets.
I have made the model interface as efficient as possible, with values_list so no objects are constructed and such. Below are the two functions tested, don't mind the spanish names.
def t3():
q = "select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000"
with connection.cursor() as c:
c.execute(q)
return list(c)
def t4():
return list(EventoAgendado.objects.all().values_list('id','numerosDisponibles')[:1000])
Then using a function to time (self made with time.clock())
r1 = timeme(t3); r2 = timeme(t4)
The results are as follows:
0.00180384529631 and 0.00493390727024 for t3 and t4
And just to make sure the queries are and take the same to execute:
connection.queries[-2::]
Yields:
[
{u'sql': u'select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000', u'time': u'0.002'},
{u'sql': u'SELECT `samiBackend_eventoagendado`.`id`, `samiBackend_eventoagendado`.`numerosDisponibles` FROM `samiBackend_eventoagendado` LIMIT 1000', u'time': u'0.002'}
]
As you can see, two exact queries, returning two exact lists (performing r1 == r2 returns True), takes totally different timings (difference gets bigger with a bigger query set), I know python is slow, but is django doing so much work behind the scenes to make the query that slower?
Also, just to make sure, I have tried building the queryset object first (outside the timer) but results are the same, so I'm 100% sure the extra time comes from fetching and building the result structure.
I have also tried using the iterator() function at the end of the query but that doesn't help neither.
I know the difference is minimal, both execute blazingly fast, but this is being bencharked with apache ab, and this minimal difference, when having 1k concurrent requests, makes day and light.
By the way, I'm using django 1.7.10 with mysqlclient as the db connector.
EDIT: For the sake of comparison, the same test with a 11k result query set, the difference gets even bigger (3x slower, compared to the first one where it is around 2.6x slower)
r1 = timeme(t3); r2 = timeme(t4)
0.0149241530889
0.0437563529558
EDIT2: Another funny test, if I actually convert the queryset object to it's actual string query (with str(queryset.query)), and use it inside a raw query instead, I get the same good performance as the raw query, by the execption that using the queryset.query string sometimes gives me an actual invalid SQL query (ie, if the queryset has a filter on a date value, the date value is not escaped with '' on the string query, giving an sql error when executing it with a raw query, this is another mystery)
-- EDIT3:
Going through the code, seems like the difference is made by how the result data is retrieved, for a raw query set, it simply calls iter(self.cursor) which I believe when using a C implemented connector will run all in C code (as iter is also a built in), while the ValuesListQuerySet is actually a python level for loop with a yield tuple(row) statement, which will be quite slow. I guess there's nothing to be done in this matter to have the same performance as the raw query set :'(.
If anyone is interested, the slow loop is this one:
for row in self.query.get_compiler(self.db).results_iter():
yield tuple(row)
-- EDIT 4: I have come with a very hacky code to convert a values list query set into usable data to be sent to a raw query, having the same performance as running a raw query, I guess this is very bad and will only work with mysql, but, the speed up is very nice while allowing me to keep the model api filtering and such. What do you think?
Here's the code.
def querysetAsRaw(qs):
q = qs.query.get_compiler(qs.db).as_sql()
with connection.cursor() as c:
c.execute(q[0], q[1])
return c
The answer was simple, update to django 1.8 or above, which changed some code that no longer has this issue in performance.
It should be simple, bit I've spent the last hour searching for the answer. This is using psycopg2 on python 2.6.
I need something like this:
special_id = 5
sql = """
select count(*) as ct,
from some_table tbl
where tbl.id = %(the_id)
"""
cursor = connection.cursor()
cursor.execute(sql, {"the_id" : special_id})
I cannot get this to work. Were special_id a string, I could replace %(the_id) with %(the_id)s and things work well. However, I want it to use the integer so that it hits my indexes correctly.
There is a surprising lack of specific information on psycopg2 on the internet. I hope someone has an answer to this seemingly simple question.
Per PEP 249, since in psycopg2 paramstyle is pyformat, you need to use %(the_id)s even for non-strings -- trust it to do the right thing.
BTW, internet searches will work better if you use the correct spelling (no h there), but even if you mis-spelled, I'm surprised you didn't get a "did you mean" hint (I did when I deliberately tried!).
How is it possible to implement an efficient large Sqlite db search (more than 90000 entries)?
I'm using Python and SQLObject ORM:
import re
...
def search1():
cr = re.compile(ur'foo')
for item in Item.select():
if cr.search(item.name) or cr.search(item.skim):
print item.name
This function runs in more than 30 seconds. How should I make it run faster?
UPD: The test:
for item in Item.select():
pass
... takes almost the same time as my initial function (0:00:33.093141 to 0:00:33.322414). So the regexps eat no time.
A Sqlite3 shell query:
select '' from item where name like '%foo%';
runs in about a second. So the main time consumption happens due to the inefficient ORM's data retrieval from db. I guess SQLObject grabs entire rows here, while Sqlite touches only necessary fields.
The best way would be to rework your logic to do the selection in the database instead of in your python program.
Instead of doing Item.select(), you should rework it to do Item.select("""name LIKE ....
If you do this, and make sure you have the name and skim columns indexed, it will return very quickly. 90000 entries is not a large database.
30 seconds to fetch 90,000 rows might not be all that bad.
Have you benchmarked the time required to do the following?
for item in Item.select():
pass
Just to see if the time is DB time, network time or application time?
If your SQLite DB is physically very large, you could be looking at -- simply -- a lot of physical I/O to read all that database stuff in.
If you really need to use a regular expression, there's not really anything you can do to speed that up tremendously.
The best thing would be to write an sqlite function that performs the comparison for you in the db engine, instead of Python.
You could also switch to a db server like postgresql that has support for SIMILAR.
http://www.postgresql.org/docs/8.3/static/functions-matching.html
I would definitely take a suggestion of Reed to pass the filter to the SQL (forget the index part though).
I do not think that selecting only specified fields or all fields make a difference (unless you do have a lot of large fields). I would bet that SQLObject creates/instanciates 80K objects and puts them into a Session/UnitOfWork for tracking. This could definitely take some time.
Also if you do not need objects in your session, there must be a way to select just what the fields you need using custom-query creation so that no Item objects are created, but only tuples.
Initially doing regex via Python was considered for y_serial, but that
was dropped in favor of SQLite's GLOB (which is far faster).
GLOB is similar to LIKE except that it's syntax is more
conventional: * instead of %, ? instead of _ .
See the Endnotes at http://yserial.sourceforge.net/ for more details.
Given your example and expanding on Reed's answer your code could look a bit like the following:
import re
import sqlalchemy.sql.expression as expr
...
def search1():
searchStr = ur'foo'
whereClause = expr.or_(itemsTable.c.nameColumn.contains(searchStr), itemsTable.c.skimColumn.contains(searchStr))
for item in Items.select().where(whereClause):
print item.name
which translates to
SELECT * FROM items WHERE name LIKE '%foo%' or skim LIKE '%foo%'
This will have the database do all the filtering work for you instead of fetching all 90000 records and doing possibly two regex operations on each record.
You can find some info here on the .contains() method here.
As well as the SQLAlchemy SQL Expression Language Tutorial here.
Of course the example above assumes variable names for your itemsTable and the column it has (nameColumn and skimColumn).