I have a Postgres 9.3 table that has a column called id as PKEY, id is char(9), and only allow lowercase a-z0-9, I use Python with psycopg to insert to this table.
When I need to insert into this table, I call a Python function get_new_id(), my question is, how to make get_new_id() efficient?
I have the following solutions, none of them satisfy me.
a) Pre-generate a lot of ids, store them in some table, when I need a new id, I SELECT one from this table, then delete it from this table, then return this selected id. Down side of this solution is that it need to maintain this table, in each get_new_id() call, there will also have a SELECT COUNT in order to find out if I need to generate more ids to put into this table.
b) When get_new_id() gets called, it generate a random id, then pass this id to a stored procedure to check if this id is already in use, if no, we are good, if yes, do b) again. Down side of this solution is, when the table gets bigger, the failure rate may be high, and there is a chance that, two get_new_id() calls in two processes will generate the same id, say, 1234567, and 1234567 is not used a PKEY yet, so, when insert, one process will fail.
I think this is a pretty old problem, what's the perfect solution?
Edit
I think this has been answered, see Jon Clements' comment.
Offtopic because you already have a char(9) datatype:
I would use an UUID when a random string is needed, it's a standard and almost any programming language (including Python) can generate UUIDs for you.
PostgreSQL can also do it for you, using the uuid-ossp extension.
select left(md5(random()::text || now()), 9);
left
-----------
c4c384561
Make the id the primary key and try the insert. If an exception is thrown catch it and retry. Nothing fancy about it. why only 9 characters? Make it the full 32.
Check this answer for how to make it smaller: https://stackoverflow.com/a/15982876/131874
Related
This question already has answers here:
How do you escape strings for SQLite table/column names in Python?
(8 answers)
Closed 7 years ago.
I have a wide table in a sqlite3 database, and I wish to dynamically query certain columns in a Python script. I know that it's bad to inject parameters by string concatenation, so I tried to use parameter substitution instead.
I find that, when I use parameter substitution to supply a column name, I get unexpected results. A minimal example:
import sqlite3 as lite
db = lite.connect("mre.sqlite")
c = db.cursor()
# Insert some dummy rows
c.execute("CREATE TABLE trouble (value real)")
c.execute("INSERT INTO trouble (value) VALUES (2)")
c.execute("INSERT INTO trouble (value) VALUES (4)")
db.commit()
for row in c.execute("SELECT AVG(value) FROM trouble"):
print row # Returns 3
for row in c.execute("SELECT AVG(:name) FROM trouble", {"name" : "value"}):
print row # Returns 0
db.close()
Is there a better way to accomplish this than simply injecting a column name into a string and running it?
As Rob just indicated in his comment, there was a related SO post that contains my answer. These substitution constructions are called "placeholders," which is why I did not find the answer on SO initially. There is no placeholder pattern for column names, because dynamically specifying columns is not a code safety issue:
It comes down to what "safe" means. The conventional wisdom is that
using normal python string manipulation to put values into your
queries is not "safe". This is because there are all sorts of things
that can go wrong if you do that, and such data very often comes from
the user and is not in your control. You need a 100% reliable way of
escaping these values properly so that a user cannot inject SQL in a
data value and have the database execute it. So the library writers do
this job; you never should.
If, however, you're writing generic helper code to operate on things
in databases, then these considerations don't apply as much. You are
implicitly giving anyone who can call such code access to everything
in the database; that's the point of the helper code. So now the
safety concern is making sure that user-generated data can never be
used in such code. This is a general security issue in coding, and is
just the same problem as blindly execing a user-input string. It's a
distinct issue from inserting values into your queries, because there
you want to be able to safely handle user-input data.
So, the solution is that there is no problem in the first place: inject the values using string formatting, be happy, and move on with your life.
Why not use string formatting?
for row in c.execute("SELECT AVG({name}) FROM trouble".format(**{"name" : "value"})):
print row # => (3.0,)
Hello StackEx community.
I am implementing a relational database using SQLite interfaced with Python. My table consists of 5 attributes with around a million tuples.
To avoid large number of database queries, I wish to execute a single query that updates 2 attributes of multiple tuples. These updated values depend on the tuples' Primary Key value and so, are different for each tuple.
I am trying something like the following in Python 2.7:
stmt= 'UPDATE Users SET Userid (?,?), Neighbours (?,?) WHERE Username IN (?,?)'
cursor.execute(stmt, [(_id1, _Ngbr1, _name1), (_id2, _Ngbr2, _name2)])
In other words, I am trying to update the rows that have Primary Keys _name1 and _name2 by substituting the Neighbours and Userid columns with corresponding values. The execution of the two statements returns the following error:
OperationalError: near "(": syntax error
I am reluctant to use executemany() because I want to reduce the number of trips across the database.
I am struggling with this issue for a couple of hours now but couldn't figure out either the error or an alternate on the web. Please help.
Thanks in advance.
If the column that is used to look up the row to update is properly indexed, then executing multiple UPDATE statements would be likely to be more efficient than a single statement, because in the latter case the database would probably need to scan all rows.
Anyway, if you really want to do this, you can use CASE expressions (and explicitly numbered parameters, to avoid duplicates):
UPDATE Users
SET Userid = CASE Username
WHEN ?5 THEN ?1
WHEN ?6 THEN ?2
END,
Neighbours = CASE Username
WHEN ?5 THEN ?3
WHEN ?6 THEN ?4
END,
WHERE Username IN (?5, ?6);
I wrote this python script to import a specific xls file into mysql. It works fine but if it's run twice on the same data it will create duplicate entries. I'm pretty sure I need to use MySQL JOIN but I'm not clear on how to do that. Also is executemany() going to have the same overhead as doing inserts in a loop? I'm obviously trying to avoid that.
Here's the code in question...
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
if name and email:
mailing_list[name.lstrip()] = email.strip()
for n, e in sorted(mailing_list.iteritems()):
rows.append((n, e))
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""
INSERT IGNORE INTO mailing_list (name, email) VALUES (%s,%s)""",(rows))
CLARIFICATION...
I read here that...
To be sure, executemany() is effectively the same as simple iteration.
However, it is typically faster. It provides an optimized means of
affecting INSERT and REPLACE across multiple rows.
Also I took Unodes suggestion and used the UNIQUE constraint. But the IGNORE keyword is better than ON DUPLICATE KEY UPDATE because I want it to fail silently.
TL;DR
1. What's the best way prevent duplicate inserts?
ANSWER 1: UNIQUE contraint on column with SELECT IGNORE to fail silently or ON DUPLICATE KEY UPDATE to increment the duplicate value and insert it.
Is executemany() as expensive as INSERT in a loop?
#Unode says it's not but my research tells me otherwise. I would like a definitive answer.
Is this the best way or is it going to be really slow with bigger
tables and how would I test to be sure?
1 - What's the best way prevent duplicate inserts?
Depending on what "preventing" means in your case, you have two strategies and one requirement.
The requirement is that you add a UNIQUE constraint on the column/columns that you want to be unique. This alone will cause an error if insertion of a duplicate entry is attempted. However given you are using executemany the outcome may not be what you would expect.
Then as strategies you can do:
An initial filter step by running a SELECT statement before. This means running one SELECT statement per item in your rows to check if it exists already. This strategy works but is inefficient.
Using ON DUPLICATE KEY UPDATE. This automatically triggers an update if the data already exists. For more information refer to the official documentation.
2 - Is executemany() as expensive as INSERT in a loop?
No, executemany creates one query which inserts in bulk while doing a for loop will create as many queries as the number of elements in your rows.
I have two Django-ORM managed databases that I'd like to merge. Both have a very similar schema, and both have the standard auth_users table, along with a few other shared tables that reference each other as well as auth_users, which I'd like to merge into a single database automatically.
Understandably, this could be very non-trivial depending upon the foreign-key relationships, and what constitutes a "unique" record in each table.
Does anyone know if there exists a tool to do this merge operation?
If nothing like this currently exists, I was considering writing my own management command, based on the standard loaddata command. Essentially, you'd use the standard dumpdata command to export tables from a source database, and then use a modified version of loaddata to "merge" them into the destination database.
For example, if I have databases A and B, and I want to merge database B into database A, then I'd want to follow a procedure according to the pseudo-code:
merge_database_dst = A
merge_database_src = B
for table in sorted(merge_database_dst.get_redundant_tables(merge_database_src), key=acyclic_dependency):
key = table.get_unique_column_key()
src_id_to_dst_id = {}
for record_src in merge_database_src.table.objects.all():
src_key_value = record_src.get_key_value(key)
try:
record_dst = merge_database_dst.table.objects.get(key)
dst_key_value = record_dst.get_key_value(key)
except merge_database_dst.table.DoesNotExist:
record_dst = merge_database_dst.table(**[(k,convert_fk(v)) for k,v in record_src._meta.fields])
record_dst.save()
dst_key_value = record_dst.get_key_value(key)
src_id_to_dst_id[(table,record_src.id)] = record_dst.id
The convert_fk() function would use the src_id_to_dst_id index to convert foreign key references in the source table to the equivalent IDs in the destination table.
To summarize, the algorithm would iterate over the table to be merged in the order of dependency, with parents iterated over first. So if we wanted to merge tables auth_users and mycustomprofile, which is dependent on auth_users, we'd iterate ['auth_users','mycustomprofile'].
Each merged table would need some sort of indicator documenting the combination of columns that denotes a universally unique record (i.e. the "key"). For auth_users, that might be the "username" and/or "email" column.
If the value of the key in database B already exists in A, then the record is not imported from B, but the ID of the existing record in A is recorded.
If the value of the key in database B does not exist in A, then the record is imported from B, and the ID of the new record is recorded.
Using the previously recorded ID, a mapping is created, explaining how to map foreign-key references to that specific record in B to the new merged/pre-existing record in A. When future records are merged into A, this mapping would be used to convert the foreign keys.
I could still envision some cases where an imported record references a table not included in the dumpdata, which might cause the entire import to fail, therefore some sort of "dryrun" option would be needed to simulate the import to ensure all FK references can be translated.
Does this seem like a practical approach? Is there a better way?
EDIT: This isn't exactly what I'm looking for, but I thought others might find it interesting. The Turbion project has a mechanism for copying changes between equivalent records in different Django models within the same database. It works by defining a translation layer (i.e. merging.ModelLayer) between two Django models, so, say if you update the "www" field in user bob#bob.com's profile, it'll automatically update the "url" field in user bob#bob.com's otherprofile.
The functionality I'm looking for is a bit different, in that I want to merge an entire (or partial) database snapshot at infrequent intervals, sort of the way the loaddata management command does.
Wow. This is going to be a complex job regardless. That said:
If I understand the needs of your project correctly, this can be something that can be done using a data migration in South. Even so, I'd be lying if I said it was going to be a joke.
My recommendation is -- and this is mostly a parrot of an assumption in your question, but I want to make it clear -- that you have one "master" table that is the base, and which has records from the other table added to it. So, table A keeps all of its existing records, and only gets additions from B. B feeds additions into A, and once done, B is deleted.
I'm hesitant to write you sample code because your actual job will be so much more complex than this, but I will anyway to try and point you in the right direction. Consider something like...
import datetime
from south.db import db
from south.v2 import DataMigration
from django.db import models
class Migration(DataMigration):
def forwards(self, orm):
for b in orm.B.objects.all():
# sanity check: does this item get copied into A at all?
if orm.A.objects.filter(username=b.username):
continue
# make an A record with the properties of my B record
a = orm.A(
first_name=b.first_name,
last_name=b.last_name,
email_address=b.email_address,
[...]
)
# save the new A record, and delete the B record
a.save()
b.delete()
def backwards(self, orm):
# backwards method, if you write one
This would end up migrating all of the Bs not in A to A, and leave you a table of Bs that are expected duplicates, which you could then check by some other means before deleting.
Like I said, this sample isn't meant to be complete. If you decide to go this route, spend time in the South documentation, and particularly make sure you look at data migrations.
That's my 2ยข. Hope it helps.
I'm having a database (sqlite) of members of an organisation (less then 200 people). Now I'm trying to write an wx app that will search the database and return some contact information in a wx.grid. The app will have 2 TextCtrls, one for the first name and one for the last name. What I want to do here is make it possible to only write one or a few letters in the textctrls and that will start to return result. So, if I search "John Smith" I write "Jo" in the first TextCtrl and that will return every single John (or any one else having a name starting with those letters). It will not have an "search"-button, instead it will start searching whenever I press a key.
One way to solve this would be to search the database with like " SELECT * FROM contactlistview WHERE forname LIKE 'Jo%' " But that seems like a bad idea (very database heavy to do that for every keystroke?). Instead i thought of use fetchall() on a query like this " SELECT * FROM contactlistview " and then, for every keystroke, search the list of tuples that the query have returned. And that is my problem: Searching a list is not that difficult but how can I search a list of tuples with wildcards?
selected = [t for t in all_data if t[1].startswith('Jo')]
but, measure, don't guess. I think that in some cases, the query would be faster - specially if you have too many records. Maybe you should use a query on the first char, and then start using python-side filter, since you already have the results.
I think that generally, you shouldn't be afraid of giving tasks to a database. It's quite possible that the LIKE clause will be very fast. Sqlite is implemented in fairly robust C code, and will happily deal with queries like this.
If you're worried about sending too many requests, why not send a query once a user has entered a threshold of characters, such as three?
A list comprehension is probably the best way to return the result if you want to do added filtering.
If you are searching for a string matching the start using LIKE, eg 'abc%' (rather than anywhere in the string - '%abc%'), the search should be quite fast if you have an index on the field, as the db can use the index to help find the matches.