Sorry for the vague question, let me explain...
I have a list of words and counts in a database that has, no doubt, reached a gigantic amount. ~80mb database with each entry being two columns (word, integer)
Now when I am trying to add a word, I check to see if it is already in the database like this...python sqlite3 class method...
self.c.execute('SELECT * FROM {tn} WHERE {cn} = """{wn}"""'.format(tn=self.table1, cn=self.column1, wn=word_name))
exist = self.c.fetchall()
if exist:
do something
So you're checking for the existence of a word within a very large table of words? I think the short and simple answer to your question is to create an index for your word column.
The next step would be to setup a real database (e.g. Postgres) instead of sqlite. Sqlite is doesn't have the optimization tweaks of a production database and you'd likely see a performance gain after switching.
Even for a table with millions of rows, this shouldn't be a super time-intensive query if your table is properly indexed. If you already have an index and are still facing performance issues there's either something wrong with either your database setup/environment or perhaps there's a bottleneck in your Python code or DB adapter. Hard to say without more information.
I would imagine that using COUNT within SQL would be faster:
self.c.execute('SELECT COUNT(*) FROM {tn} WHERE {cn} = """{wn}"""'.format(tn=self.table1, cn=self.column1, wn=word_name))
num = self.c.fetchone()[0]
if num:
#do something
though I haven't tested it.
See How to check the existence of a row in SQLite with Python? for a similar question.
Related
What is the fastest way of checking whether a record exists, when I know the primary key? select, count, filter, where or something else?
When you use count, the database has to continue the search even if it found the record, because a second one might exist.
So you should search for the actual record, and tell the database to stop after the first one.
When you ask to return data from the record, then the database has to read that data from the table. But if the record can be found by looking up the ID in an index, that table access would be superfluous.
So you should return nothing but the ID you're using to search:
SELECT id FROM MyTable WHERE id = ? LIMIT 1;
Anyway, not reading the actual data and the limit are implied when you are using EXISTS, which is simpler in peewee:
SELECT EXISTS (SELECT * FROM MyTable WHERE id = ?);
MyTable.select().where(MyTable.id == x).exists()
You can check yourself via EXPLAIN QUERY PLAN which will tell you the cost & what it intends to do for a particular query.
Costs don't directly compare between runs, but you should get a decent idea of whether there are any major differences.
That being said, I would expect COUNT(id) FROM table WHERE table.id="KEY" is probably the ideal, as it will take advantage of any partial lookup ability (particularly fast in columnar databases like amazon's redshift) and the primary key indexing.
First off, this is my first project using SQLAlchemy, so I'm still fairly new.
I am making a system to work with GTFS data. I have a back end that seems to be able to query the data quite efficiently.
What I am trying to do though is allow for the GTFS files to update the database with new data. The problem that I am hitting is pretty obvious, if the data I'm trying to insert is already in the database, we have a conflict on the uniqueness of the primary keys.
For Efficiency reasons, I decided to use the following code for insertions, where model is the model object I would like to insert the data into, and data is a precomputed, cleaned list of dictionaries to insert.
for chunk in [data[i:i+chunk_size] for i in xrange(0, len(data), chunk_size)]:
engine.execute(model.__table__.insert(),chunk)
There are two solutions that come to mind.
I find a way to do the insert, such that if there is a collision, we don't care, and don't fail. I believe that the code above is using the TableClause, so I checked there first, hoping to find a suitable replacement, or flag, with no luck.
Before we perform the cleaning of the data, we get the list of primary key values, and if a given element matches on the primary keys, we skip cleaning and inserting the value. I found that I was able to get the PrimaryKeyConstraint from Table.primary_key, but I can't seem to get the Columns out, or find a way to query for only specific columns (in my case, the Primary Keys).
Either should be sufficient, if I can find a way to do it.
After looking into both of these for the last few hours, I can't seem to find either. I was hoping that someone might have done this previously, and point me in the right direction.
Thanks in advance for your help!
Update 1: There is a 3rd option I failed to mention above. That is to purge all the data from the database, and reinsert it. I would prefer not to do this, as even with small GTFS files, there are easily hundreds of thousands of elements to insert, and this seems to take about half an hour to perform, which means if this makes it to production, lots of downtime for updates.
With SQLAlchemy, you simply create a new instance of the model class, and merge it into the current session. SQLAlchemy will detect if it already knows about this object (from cache or the database) and will add a new row to the database if needed.
newentry = model(chunk)
session.merge(newentry)
Also see this question for context: Fastest way to insert object if it doesn't exist with SQLAlchemy
I am using SQLAlchemy. I want to delete all the records efficiently present in database but I don't want to drop the table/database.
I tried with the following code:
con = engine.connect()
trans = con.begin()
con.execute(table.delete())
trans.commit()
It seems, it is not a very efficient one since I am iterating over all tables present in the database.
Can someone suggest a better and more efficient way of doing this?
If you models rely on the existing DB schema (usually use autoload=True), you cannot avoid deleting data in each table. MetaData.sorted_tables comes in handy:
for tbl in reversed(meta.sorted_tables):
engine.execute(tbl.delete())
If your models do define the complete schema, there is nothing simpler than drop_all/create_all (as already pointed out by #jadkik94).
Further, TRUNCATE would anyways not work on the tables which are referenced by ForeignKeys, which is limiting the usage significantly.
For me putting tbl.drop(engine) worked, but not engine.execute(tbl.delete())
SQLAlchemy 0.8.0b2 and
Python 2.7.3
I'm in the processing of moving over a mysql database to a postgres database. I have read all of the articles presented here, as well as reading over some of the solutions presented on stackoverflow. The tools recommended don't seem to work for me. Both databases were generated by Django's syncdb, although the postgres db is more or less empty at the moment. I tried to migrate the tables over using Django's built in dumpdata / loaddata functions and its serializers, but it doesn't seem to like a lot of my tables, leading me to believe that writing a manual solution might be best in this case. I have code to verify that the column headers are the same for each table in the database and that the matching tables exist- that works fine. I was thinking it would be best to just grab the mysql data row by row and then insert it into the respective postgres table row by row (I'm not concerned with speed atm). The one thing is, I don't know what's the proper way to construct the insert statement. I have something like:
table_name = retrieve_table()
column_headers = get_headers(table_name) #can return a tuple or a list
postgres_cursor = postgres_con.cursor()
rows = mysql_cursor.fetchall()
for row in rows: #row is a tuple
postgres_cursor.execute(????)
Where ??? would be the insert statement. I just don't know what the proper way is to construct it. I have the table name that I would like to insert into as a string, I have the column headers that I can treat as a list, tuple, or string, and I have the respective values that I'd like to insert. What would be the recommended way to construct the statement? I have read the documentation on psycopg's documentation page and I didn't quite see the way that would satisfy my needs. I don't know (or think) this is the entirely correct way to properly migrate, so if someone could steer me in the correct way or offer any advice I'd really appreciate it.
Ok my Giant friends once again I seek a little space in your shoulders :P
Here is the issue, I have a python script that is fixing some database issues but it is taking way too long, the main update statement is this:
cursor.execute("UPDATE jiveuser SET username = '%s' WHERE userid = %d" % (newName,userId))
That is getting called about 9500 times with different newName and userid pairs...
Any suggestions on how to speed up the process? Maybe somehow a way where I can do all updates with just one query?
Any help will be much appreciated!
PS: Postgres is the db being used.
Insert all the data into another empty table (called userchanges, say) then UPDATE in a single batch:
UPDATE jiveuser
SET username = userchanges.username
FROM userchanges
WHERE userchanges.userid = jiveuser.userid
AND userchanges.username <> jiveuser.username
See this documentation on the COPY command for bulk loading your data.
There are also tips for improving performance when populating a database.
First of all, do not use the % operator to construct your SQL. Instead, pass your tuple of arguments as the second parameter to cursor.execute, which also negates the need to quote your argument and allows you to use %s for everything:
cursor.execute("UPDATE jiveuser SET username = %s WHERE userid = %s", (newName, userId))
This is important to prevent SQL Injection attacks.
To answer your question, you can speed up these updates by creating an index on the userid column, which will allow the database to update in O(1) constant time rather than having to scan the entire database table, which is O(n). Since you're using PostgreSQL, here's the syntax to create your index:
CREATE INDEX username_lookup ON jiveuser (userid);
EDIT: Since your comment reveals that you already have an index on the userid column, there's not much you could possibly do to speed up that query. So your main choices are either living with the slowness, since this sounds like a one-time fix-something-broken thing, or following VeeArr's advice and testing whether cursor.executemany will give you a sufficient boost.
The reason it's taking so long is probably that you've got autocommit enabled and each update gets done in its own transaction.
This is slow because even if you have a battery-backed raid controller (which you should definitely have on all database servers, of course), it still needs to do a write into that device for every transaction commit to ensure durability.
The solution is to do more than one row per transaction. But don't make transactions TOO big or you run into problems too. Try committing every 10,000 rows of changes as a rough guess.
You might want to look into executemany(): Information here
Perhaps you can create an index on userid to speed things up.
I'd do an explain on this. If it's doing an indexed lookup to find the record -- which it should if you have an index on userid -- then I don't see what you could do to improve performance. If it's not using the index, then the trick is figuring out why not and fixing it.
Oh, you could try using a prepared statement. With 9500 inserts, that should help.
Move this to a stored procedure and execute it from the database self.
First ensure you have an index on 'userid', this will ensure the dbms doesn't have to do a table scan each time
CREATE INDEX jiveuser_userid ON jiveuser (userid);
Next try preparing the statement, and then calling execute on it. This will stop the optimizer from having to examine the query each time
PREPARE update_username(string,integer) AS UPDATE jiveuser SET username = $1 WHERE userid = $2;
EXECUTE update_username("New Name", 123);
Finally, a bit more performance could be squeezed out by turning off autocommit
\set autocommit off