How is it possible to implement an efficient large Sqlite db search (more than 90000 entries)?
I'm using Python and SQLObject ORM:
import re
...
def search1():
cr = re.compile(ur'foo')
for item in Item.select():
if cr.search(item.name) or cr.search(item.skim):
print item.name
This function runs in more than 30 seconds. How should I make it run faster?
UPD: The test:
for item in Item.select():
pass
... takes almost the same time as my initial function (0:00:33.093141 to 0:00:33.322414). So the regexps eat no time.
A Sqlite3 shell query:
select '' from item where name like '%foo%';
runs in about a second. So the main time consumption happens due to the inefficient ORM's data retrieval from db. I guess SQLObject grabs entire rows here, while Sqlite touches only necessary fields.
The best way would be to rework your logic to do the selection in the database instead of in your python program.
Instead of doing Item.select(), you should rework it to do Item.select("""name LIKE ....
If you do this, and make sure you have the name and skim columns indexed, it will return very quickly. 90000 entries is not a large database.
30 seconds to fetch 90,000 rows might not be all that bad.
Have you benchmarked the time required to do the following?
for item in Item.select():
pass
Just to see if the time is DB time, network time or application time?
If your SQLite DB is physically very large, you could be looking at -- simply -- a lot of physical I/O to read all that database stuff in.
If you really need to use a regular expression, there's not really anything you can do to speed that up tremendously.
The best thing would be to write an sqlite function that performs the comparison for you in the db engine, instead of Python.
You could also switch to a db server like postgresql that has support for SIMILAR.
http://www.postgresql.org/docs/8.3/static/functions-matching.html
I would definitely take a suggestion of Reed to pass the filter to the SQL (forget the index part though).
I do not think that selecting only specified fields or all fields make a difference (unless you do have a lot of large fields). I would bet that SQLObject creates/instanciates 80K objects and puts them into a Session/UnitOfWork for tracking. This could definitely take some time.
Also if you do not need objects in your session, there must be a way to select just what the fields you need using custom-query creation so that no Item objects are created, but only tuples.
Initially doing regex via Python was considered for y_serial, but that
was dropped in favor of SQLite's GLOB (which is far faster).
GLOB is similar to LIKE except that it's syntax is more
conventional: * instead of %, ? instead of _ .
See the Endnotes at http://yserial.sourceforge.net/ for more details.
Given your example and expanding on Reed's answer your code could look a bit like the following:
import re
import sqlalchemy.sql.expression as expr
...
def search1():
searchStr = ur'foo'
whereClause = expr.or_(itemsTable.c.nameColumn.contains(searchStr), itemsTable.c.skimColumn.contains(searchStr))
for item in Items.select().where(whereClause):
print item.name
which translates to
SELECT * FROM items WHERE name LIKE '%foo%' or skim LIKE '%foo%'
This will have the database do all the filtering work for you instead of fetching all 90000 records and doing possibly two regex operations on each record.
You can find some info here on the .contains() method here.
As well as the SQLAlchemy SQL Expression Language Tutorial here.
Of course the example above assumes variable names for your itemsTable and the column it has (nameColumn and skimColumn).
Related
I have noticed a huge timing difference between using django connection.cursor vs using the model interface, even with small querysets.
I have made the model interface as efficient as possible, with values_list so no objects are constructed and such. Below are the two functions tested, don't mind the spanish names.
def t3():
q = "select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000"
with connection.cursor() as c:
c.execute(q)
return list(c)
def t4():
return list(EventoAgendado.objects.all().values_list('id','numerosDisponibles')[:1000])
Then using a function to time (self made with time.clock())
r1 = timeme(t3); r2 = timeme(t4)
The results are as follows:
0.00180384529631 and 0.00493390727024 for t3 and t4
And just to make sure the queries are and take the same to execute:
connection.queries[-2::]
Yields:
[
{u'sql': u'select id, numerosDisponibles FROM samibackend_eventoagendado LIMIT 1000', u'time': u'0.002'},
{u'sql': u'SELECT `samiBackend_eventoagendado`.`id`, `samiBackend_eventoagendado`.`numerosDisponibles` FROM `samiBackend_eventoagendado` LIMIT 1000', u'time': u'0.002'}
]
As you can see, two exact queries, returning two exact lists (performing r1 == r2 returns True), takes totally different timings (difference gets bigger with a bigger query set), I know python is slow, but is django doing so much work behind the scenes to make the query that slower?
Also, just to make sure, I have tried building the queryset object first (outside the timer) but results are the same, so I'm 100% sure the extra time comes from fetching and building the result structure.
I have also tried using the iterator() function at the end of the query but that doesn't help neither.
I know the difference is minimal, both execute blazingly fast, but this is being bencharked with apache ab, and this minimal difference, when having 1k concurrent requests, makes day and light.
By the way, I'm using django 1.7.10 with mysqlclient as the db connector.
EDIT: For the sake of comparison, the same test with a 11k result query set, the difference gets even bigger (3x slower, compared to the first one where it is around 2.6x slower)
r1 = timeme(t3); r2 = timeme(t4)
0.0149241530889
0.0437563529558
EDIT2: Another funny test, if I actually convert the queryset object to it's actual string query (with str(queryset.query)), and use it inside a raw query instead, I get the same good performance as the raw query, by the execption that using the queryset.query string sometimes gives me an actual invalid SQL query (ie, if the queryset has a filter on a date value, the date value is not escaped with '' on the string query, giving an sql error when executing it with a raw query, this is another mystery)
-- EDIT3:
Going through the code, seems like the difference is made by how the result data is retrieved, for a raw query set, it simply calls iter(self.cursor) which I believe when using a C implemented connector will run all in C code (as iter is also a built in), while the ValuesListQuerySet is actually a python level for loop with a yield tuple(row) statement, which will be quite slow. I guess there's nothing to be done in this matter to have the same performance as the raw query set :'(.
If anyone is interested, the slow loop is this one:
for row in self.query.get_compiler(self.db).results_iter():
yield tuple(row)
-- EDIT 4: I have come with a very hacky code to convert a values list query set into usable data to be sent to a raw query, having the same performance as running a raw query, I guess this is very bad and will only work with mysql, but, the speed up is very nice while allowing me to keep the model api filtering and such. What do you think?
Here's the code.
def querysetAsRaw(qs):
q = qs.query.get_compiler(qs.db).as_sql()
with connection.cursor() as c:
c.execute(q[0], q[1])
return c
The answer was simple, update to django 1.8 or above, which changed some code that no longer has this issue in performance.
I'm having a database (sqlite) of members of an organisation (less then 200 people). Now I'm trying to write an wx app that will search the database and return some contact information in a wx.grid. The app will have 2 TextCtrls, one for the first name and one for the last name. What I want to do here is make it possible to only write one or a few letters in the textctrls and that will start to return result. So, if I search "John Smith" I write "Jo" in the first TextCtrl and that will return every single John (or any one else having a name starting with those letters). It will not have an "search"-button, instead it will start searching whenever I press a key.
One way to solve this would be to search the database with like " SELECT * FROM contactlistview WHERE forname LIKE 'Jo%' " But that seems like a bad idea (very database heavy to do that for every keystroke?). Instead i thought of use fetchall() on a query like this " SELECT * FROM contactlistview " and then, for every keystroke, search the list of tuples that the query have returned. And that is my problem: Searching a list is not that difficult but how can I search a list of tuples with wildcards?
selected = [t for t in all_data if t[1].startswith('Jo')]
but, measure, don't guess. I think that in some cases, the query would be faster - specially if you have too many records. Maybe you should use a query on the first char, and then start using python-side filter, since you already have the results.
I think that generally, you shouldn't be afraid of giving tasks to a database. It's quite possible that the LIKE clause will be very fast. Sqlite is implemented in fairly robust C code, and will happily deal with queries like this.
If you're worried about sending too many requests, why not send a query once a user has entered a threshold of characters, such as three?
A list comprehension is probably the best way to return the result if you want to do added filtering.
If you are searching for a string matching the start using LIKE, eg 'abc%' (rather than anywhere in the string - '%abc%'), the search should be quite fast if you have an index on the field, as the db can use the index to help find the matches.
Currently i'm using Alchemy as a ORM, and I look for a way to speed up my insert operation, I have bundle of XML files to import
for name in names:
p=Product()
p.name="xxx"
session.commit()
i use above code to insert my data paser from batch xml file to mysql,it's very slow
also i tried to
for name in names:
p=Product()
p.name="xxx"
session.commit()
but it seems didn't change anything
You could bypass the ORM for the insertion operation and use the SQL Expression generator instead.
Something like:
conn.execute(Product.insert(), [dict(name=name) for name in names])
That should create a single statement to do your inserting.
That example was taken from lower down the same page.
(I'd be interested to know what speedup you got from that)
I'm looking for the most efficient way to bulk-insert some millions of tuples into a database. I'm using Python, PostgreSQL and psycopg2.
I have created a long list of tulpes that should be inserted to the database, sometimes with modifiers like geometric Simplify.
The naive way to do it would be string-formatting a list of INSERT statements, but there are three other methods I've read about:
Using pyformat binding style for parametric insertion
Using executemany on the list of tuples, and
Using writing the results to a file and using COPY.
It seems that the first way is the most efficient, but I would appreciate your insights and code snippets telling me how to do it right.
Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.
There is a new psycopg2 manual containing examples for all the options.
The COPY option is the most efficient. Then the executemany. Then the execute with pyformat.
in my experience executemany is not any faster than running many inserts yourself,
the fastest way is to format a single INSERT with many values yourself, maybe in the future executemany will improve but for now it is quite slow
i subclass a list and overload the append method ,so when a the list reaches a certain size i format the INSERT to run it
You could use a new upsert library:
$ pip install upsert
(you may have to pip install decorator first)
conn = psycopg2.connect('dbname=mydatabase')
cur = conn.cursor()
upsert = Upsert(cur, 'mytable')
for (selector, setter) in myrecords:
upsert.row(selector, setter)
Where selector is a dict object like {'name': 'Chris Smith'} and setter is a dict like { 'age': 28, 'state': 'WI' }
It's almost as fast as writing custom INSERT[/UPDATE] code and running it directly with psycopg2... and it won't blow up if the row already exists.
The newest way of inserting many items is using the execute_values helper (https://www.psycopg.org/docs/extras.html#fast-execution-helpers).
from psycopg2.extras import execute_values
insert_sql = "INSERT INTO table (id, name, created) VALUES %s"
# this is optional
value_template="(%s, %s, to_timestamp(%s))"
cur = conn.cursor()
items = []
items.append((1, "name", 123123))
# append more...
execute_values(cur, insert_sql, items, value_template)
conn.commit()
Anyone using SQLalchemy could try 1.2 version which added support of bulk insert to use psycopg2.extras.execute_batch() instead of executemany when you initialize your engine with use_batch_mode=True like:
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
use_batch_mode=True)
http://docs.sqlalchemy.org/en/latest/changelog/migration_12.html#change-4109
Then someone would have to use SQLalchmey won't bother to try different combinations of sqla and psycopg2 and direct SQL together.
After some testing, unnest often seems to be an extremely fast option, as I learned from #Clodoaldo Neto's answer to a similar question.
data = [(1, 100), (2, 200), ...] # list of tuples
cur.execute("""CREATE TABLE table1 AS
SELECT u.id, u.var1
FROM unnest(%s) u(id INT, var1 INT)""", (data,))
However, it can be tricky with extremely large data.
The first and the second would be used together, not separately. The third would be the most efficient server-wise though, since the server would do all the hard work.
A very related question: Bulk insert with SQLAlchemy ORM
All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.
In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:
The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().
In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).
The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()
I have a database table with a unique string field and a couple of integer fields. The string field is usually 10-100 characters long.
Once every minute or so I have the following scenario: I receive a list of 2-10 thousand tuples corresponding to the table's record structure, e.g.
[("hello", 3, 4), ("cat", 5, 3), ...]
I need to insert all these tuples to the table (assume I verified neither of these strings appear in the database). For clarification, I'm using InnoDB, and I have an auto-incremental primary key for this table, the string is not the PK.
My code currently iterates through this list, for each tuple creates a Python module object with the appropriate values, and calls ".save()", something like so:
#transaction.commit_on_success
def save_data_elements(input_list):
for (s, i1, i2) in input_list:
entry = DataElement(string=s, number1=i1, number2=i2)
entry.save()
This code is currently one of the performance bottlenecks in my system, so I'm looking for ways to optimize it.
For example, I could generate SQL codes each containing an INSERT command for 100 tuples ("hard-coded" into the SQL) and execute it, but I don't know if it will improve anything.
Do you have any suggestion to optimize such a process?
Thanks
You can write the rows to a file in the format
"field1", "field2", .. and then use LOAD DATA to load them
data = '\n'.join(','.join('"%s"' % field for field in row) for row in data)
f= open('data.txt', 'w')
f.write(data)
f.close()
Then execute this:
LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;
Reference
For MySQL specifically, the fastest way to load data is using LOAD DATA INFILE, so if you could convert the data into the format that expects, it'll probably be the fastest way to get it into the table.
If you don't LOAD DATA INFILE as some of the other suggestions mention, two things you can do to speed up your inserts are :
Use prepared statements - this cuts out the overhead of parsing the SQL for every insert
Do all of your inserts in a single transaction - this would require using a DB engine that supports transactions (like InnoDB)
If you can do a hand-rolled INSERT statement, then that's the way I'd go. A single INSERT statement with multiple value clauses is much much faster than lots of individual INSERT statements.
Regardless of the insert method, you will want to use the InnoDB engine for maximum read/write concurrency. MyISAM will lock the entire table for the duration of the insert whereas InnoDB (under most circumstances) will only lock the affected rows, allowing SELECT statements to proceed.
what format do you receive? if it is a file, you can do some sort of bulk load: http://www.classes.cs.uchicago.edu/archive/2005/fall/23500-1/mysql-load.html
This is unrelated to the actual load of data into the DB, but...
If providing a "The data is loading... The load will be done shortly" type of message to the user is an option, then you can run the INSERTs or LOAD DATA asynchronously in a different thread.
Just something else to consider.
I donot know the exact details, but u can use json style data representation and use it as fixtures or something. I saw something similar on Django Video Workshop by Douglas Napoleone. See the videos at http://www.linux-magazine.com/online/news/django_video_workshop. and http://www.linux-magazine.com/online/features/django_reloaded_workshop_part_1. Hope this one helps.
Hope you can work it out. I just started learning django, so I can just point you to resources.