PostgreSQL Update Records in Batches - python

I need to update one column in a table with huge set of data in the PostgreSQL DB.
Since the job might run for 1 or 2 days continuously due to the large set of data, I need to do this batch wise and commit the transactions batch wise so that I can keep track of the progress and print logs of any batches that fails and run them manually later by providing the failed offset and limit.
One method I tried to do this is the following in the postgres block which failed since I can't use row_number() in where clause.
DO LANGUAGE plpgsql $$
DECLARE
row_count_ integer;
offset_ integer := 0;
batch_size_ integer := 100000;
limit_ integer := offset_ + batch_size_;
total_rows_ integer;
BEGIN
WHILE offset_ < total_rows_ LOOP
limit_ := offset_ + batch_size_;
UPDATE table1
SET column1 = 'Value'
WHERE row_number() over() >= offset_ AND row_number() over() < limit_;
GET DIAGNOSTICS row_count_ = row_count;
RAISE INFO '% rows updated from % to %', row_count_, offset_, limit_;
offset_ := offset_ + batch_size_;
END LOOP;
EXCEPTION WHEN OTHERS THEN
RAISE NOTICE 'Transaction is rolling back, % : %', SQLSTATE, SQLERRM;
ROLLBACK;
END $$;
I'm even ok to do this using a python script but I need to do this the fastest way possible. I went through many articles which uses a select sub query which is too expensive due to the join in my opinion.
Could someone please help me with a better way to achieve this?

If the activity lasts for several days, doing the UPDATE in batches makes sense. You may want to run an explicit VACUUM on the table between batches to avoid table bloat.
About your core problem, I would say that the simplest solution would be to batch by primary key values, that is, run statements like:
UPDATE tab
SET col = newval
WHERE id <= 100000
AND /* additional criteria*/;
VACUUM tab;
UPDATE tab
SET col = newval
WHERE id > 100000 AND id <= 200000
AND /* additional criteria*/;
...
Keep repeating that until you reach the maximum id.

Related

How can I avoid SQL issue?

When I run my query I got this issue --- pyodbc.ProgrammingError: ('42000', '[42000] [IBM][System i Access ODBC Driver][DB2 for i5/OS]SQL0010 - String constant beginning " " not delimited. (-10) (SQLExecDirectW)')
Can someone help me please!
cursor = db2_conn.cursor()
cursor.execute
("select * from qs36f.DSTHSTP join qs36f.calendar on date_ccyymmd = dhindt join qs36f.itemp i on DHITM#=i.ZIITM# join qs36f.subcl2p on i.zisbcl = s2sbcl and i.ziclss = s2clss join qs36f.subclsp on sbclss=s2clss and SBSBCL=s2sbcl where date_iso between(current date - 10 day) and current date and DHCUS# in (" + open_stores + ") and dhqtss >= 1 and SBDEPT='MT' AND dhclss = " + class_nbr + " and ((dhqtss*dhrt5s)*DHPACK) > " + start_range + "")
I tried run this query and get some data from db for testing. With this query I should get store number, item number and price greater than 250.
This is a classic SQL injection scenario caused by concatenating unchecked input to generate queries. String concatenation or interpolation are the root cause for this problem and can't be solved with any amount of quoting or escaping.
The real and easy solution is to use parameterized queries. Another significant improvement is creating the query string separately from the call. Putting everything in a single line is neither cleaner, faster nor easier.
sql="""select *
from qs36f.DSTHSTP
join qs36f.calendar on date_ccyymmd = dhindt
join qs36f.itemp i on DHITM#=i.ZIITM#
join qs36f.subcl2p on i.zisbcl = s2sbcl and i.ziclss = s2clss
join qs36f.subclsp on sbclss=s2clss and SBSBCL=s2sbcl
where date_iso between(current date - 10 day) and current date
and DHCUS# in (?,?,?)
and dhqtss >= 1
and SBDEPT='MT'
AND dhclss = ?
and ((dhqtss*dhrt5s)*DHPACK) > ?
"""
rows=cursor.execute(sql,shop1,shop2,shop3,class_nbr,start_range)
? specifies query parameters by position. Parameter values never become part of the query. They are sent to the server as strongly typed values (integers, dates, flaots) outside the query itself. The database compiles the query into a parameterized execution plan and then executes that with the parameter values.
This has several benefits :
Obviously, SQL injections are eliminated as the values never become part of the string.
Formatting errors are eliminated because numbers and dates are passed as binary values. You no longer have to worry about 34,56, MM/DD/YYYY or DD/MM/YYYY. You can compare date fields directly against date values, numeric fields with numbers.
The database can reuse already compiled execution plans for the same query string. That offers significant benefits for big, busy systems.

SQLite3 How to Select first 100 rows from database, then the next 100

Currently I have database filled with 1000s of rows.
I want to SELECT the first 100 rows, and then select the next 100, then the next 100 and so on...
So far I have:
c.execute('SELECT words FROM testWords')
data = c.fetchmany(100)
This allows me to get the first 100 rows, however, I can't find the syntax for selecting the next 100 rows after that, using another SELECT statement.
I've seen it is possible with other coding languages, but haven't found a solution with Python's SQLite3.
When you are using cursor.fetchmany() you don't have to issue another SELECT statement. The cursor is keeping track of where you are in the series of results, and all you need to do is call c.fetchmany(100) again until that produces an empty result:
c.execute('SELECT words FROM testWords')
while True:
batch = c.fetchmany(100)
if not batch:
break
# each batch contains up to 100 rows
or using the iter() function (which can be used to repeatedly call a function until a sentinel result is reached):
c.execute('SELECT words FROM testWords')
for batch in iter(lambda: c.fetchmany(100), []):
# each batch contains up to 100 rows
If you can't keep hold of the cursor (say, because you are serving web requests), then using cursor.fetchmany() is the wrong interface. You'll instead have to tell the SELECT statement to return only a selected window of rows, using the LIMIT syntax. LIMIT has an optional OFFSET keyword, together these two keywords specify at what row to start and how many rows to return.
Note that you want to make sure that your SELECT statement is ordered so you get a stable result set you can then slice into batches.
batchsize = 1000
offset = 0
while True:
c.execute(
'SELECT words FROM testWords ORDER BY somecriteria LIMIT ? OFFSET ?',
(batchsize, offset))
batch = list(c)
offset += batchsize
if not batch:
break
Pass the offset value to a next call to your code if you need to send these batches elsewhere and then later on resume.
sqlite3 is nothing to do with Python. It is a standalone database; Python just supplies an interface to it.
As a normal database, sqlite supports standard SQL. In SQL, you can use LIMIT and OFFSET to determine the start and end for your query. Note that if you do this, you should really use an explicit ORDER BY clause, to ensure that your results are consistently ordered between queries.
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100')
...
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100 OFFSET 100')
You can crate iterator and call it multiple times:
def ResultIter(cursor, arraysize=100):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
Or simply like this for returning the first 5 rows:
num_rows = 5
cursor = dbconn.execute("SELECT words FROM testWords" )
for row in cursor.fetchmany(num_rows):
print( "Words= " + str( row[0] ) + "\n" )

Get random record set with Django, what is affecting the performance?

It said that
Record.objects.order_by('?')[:n]
have performance issues, and recommend doing something like this: (here)
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
Since that, why not do it directly like this:
result = random.sample(Record.objects.all(),n)
I have no idea about when these code running what is django actually doing in background. Please tell me the one-line-code at last is more efficient or not? why?
================Edit 2013-5-12 23:21 UCT+8 ========================
I spent my whole afternoon to do this test.
My computer : CPU Intel i5-3210M RAM 8G
System : Win8.1 pro x64 Wampserver2.4-x64 (with apache2.4.4 mysql5.6.12 php5.4.12) Python2.7.5 Django1.4.6
What I did was:
Create an app.
build a simple model with a index and a CharField content, then Syncdb.
Create 3 views can get a random set with 20 records in 3 different ways above, and output the time used.
Modify settings.py that Django can output log into console.
Insert rows into table, untill the number of the rows is what I want.
Visit the 3 views, note the SQL Query statement, SQL time, and the total time
repeat 5, 6 in different number of rows in the table.(10k, 200k, 1m, 5m)
This is views.py:
def test1(request):
start = datetime.datetime.now()
result = Record.objects.order_by('?')[:20]
l = list(result) # Queryset是惰性的,强制将Queryset转为list
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start).microseconds/1000))
def test2(request):
start = datetime.datetime.now()
sample = random.sample(xrange(Record.objects.count()),20)
result = [Record.objects.all()[i] for i in sample]
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
def test3(request):
start = datetime.datetime.now()
result = random.sample(Record.objects.all(),20)
l = list(result)
end = datetime.datetime.now()
return HttpResponse("time: <br/> %s" % (end-start)
As #Yeo said,result = random.sample(Record.objects.all(),n) is crap. I won't talk about that.
But interestingly, Record.objects.order_by('?')[:n] always better then others, especially the table smaller then 1m rows. Here is the data:
and the charts:
So, what's happened?
In the last test, 5,195,536 rows in tatget table, result = random.sample(Record.objects.all(),n) actually did ths:
(22.275) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` ORDER BY RAND() LIMIT 20; args=()
Every one is right. And it used 22 seconds. And
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
actually did ths:
(1.393) SELECT COUNT(*) FROM `randomrecords_record`; args=()
(3.201) SELECT `randomrecords_record`.`id`, `randomrecords_record`.`content`
FROM `randomrecords_record` LIMIT 1 OFFSET 4997880; args=()
...20 lines
As you see, get one row, cost 3 seconds. I find that the larger index, the more time needed.
But... why?
My think is:
If there is some way can speed up the large index query,
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
should be the best. Except(!) the table is smaller then 1m rows.
The problem with .order_by(?) is that under the hood it does ORDER BY RAND() (or equivalent, depending on DB) which basically has to create a random number for each row and do the sorting. This is a heavy operation and requires lots of time.
On the other hand doing Record.objects.all() forces your app to download all objects and then you choose from it. It is not that heavy on the database side (it will be faster then sorting) but it is heavy on network and memory. Thus it can kill your performance as well.
So that's the tradeoff.
Now this is a lot better:
sample = random.sample(xrange(Record.objects.count()),n)
result = [Record.objects.all()[i] for i in sample]
simply because it avoids all the problems mentioned above (note that Record.objects.all()[i] gets translated to SELECT * FROM table LIMIT 1 OFFSET i, depending on DB).
However it may still be inefficient since .count might be slow (as usual: depends on DB).
Record.objects.count() gets translated into very light SQL Query.
SELECT COUNT(*) FROM TABLE
Record.objects.all()[0] is also translated into a very light SQL Query.
SELECT * FROM TABLE LIMIT 1
Record.objects.all() usually the results get slice off to increase the performance
SELECT * FROM table LIMIT 20; // or something similar
list(Record.objects.all()) will query all the data and put it into a list data structure.
SELECT * FROM TABLE
Thus, any time you convert a Queryset into a list, that's where the expensive happened
In your example, random.sample() will convert into a list. (If I'm not wrong).
Thus when you do result = random.sample(Record.objects.all(),n) it will do the Full Queryset and convert into a list and then random pick the list.
Just imagine if you have millions of records. Are you going to query and store it into a list with millions element? or would you rather query one by one

Any recommendations to improve this function?

I am very new to working with SQL queries. Any suggestions to improve this bit of code:
(by the way, I really don't care about sql security here; this is a bit of code that will be in a pyexe file connecting to a local sqlite file - so it doesnt make sense to worry about security of the query here).
def InitBars(QA = "GDP1POP1_20091224_gdp", QB = "1 pork", reset = False):
global heights, values
D, heights, values, max, = [], {}, {}, 0.0001
if reset: GHolder.remove()
Q = "SELECT wbcode, Year, "+QA+" FROM DB WHERE commodity='"+QB+"' and "+QA+" IS NOT 'NULL'"
for i in cursor.execute(Q):
D.append((str(i[0]) + str(i[1]), float(i[2])))
if float(i[2]) > max: max = float(i[2])
for (i, n) in D: heights[i] = 5.0 / max * n; values[i] = n
Gui["YRBox_Slider"].set(0.0)
Gui["YRBox_Speed"].set(0.0)
after following the advices, this is what I got:
def InitBars(QA = "GDP1POP1_20091224_gdp", QB = "1 pork", reset = False):
global heights, values; D, heights, values, max, = [], {}, {}, 0.0001
if reset: GHolder.remove()
Q = "SELECT wbcode||Year, %s FROM DB WHERE commodity='%s' and %s IS NOT 'NULL'" % (QA, QB, QA)
for a, b in cursor.execute(Q):
if float(b) > max: max = float(b)
values[a] = float(b)
for i in values: heights[i] = 5.0 / max * values[i]
Gui["YRBox_Slider"].set(0.0); Gui["YRBox_Speed"].set(0.0)
If this is a one-off script where you totally trust all of the input data and you just need to get a job done, then fine.
If this is part of a system, and this is indicative of the kind of code in it, there are several problems:
Don't construct SQL queries by appending strings. You said that you don't care about security, but this is such a big problem and so easily solved, then really -- you should do it right all of the time
This function seems to use and manipulate global state. Again, if this is a small one-time use script, then go for it -- in systems that span just a few files, this becomes impossible to maintain.
Naming conventions --- not following any consistency in capitalization
Names of things are not helpful at all. QA, D, QB, -- QA and QB don't even seem to be the same kind of thing -- one is a field, and the other is a value.
All kinds of questionable things are uncommented -- why is max .0001? What the heck is GHolder? What could that loop be doing at the end? Really, the code should be clearer, but if not, throw the maintainer a bone.
Use more descriptive variable names than QA and QB.
Comment the code.
Don't put multiple statements in the same line
Try not to use globals. Use member variables instead.
if QA and QB may come from user input, don't use them to build SQL queries
You should check for SQL injection. Make sure that there's no SQL statement in QA. Also you should probably add slashes if it applies.
Use
Q = "SELECT wbcode, Year, %s FROM DB WHERE commodity='%s' and %s IS NOT 'NULL'" % (QA, QB, QA)
instead:
Q = "SELECT wbcode, Year, "+QA+" FROM DB WHERE commodity='"+QB+"' and "+QA+" IS NOT 'NULL'"
Care about security (sql injection).
Look at any ORM (SqlAlchemy, for example). It makes things easy :)

What's the best way to optimize this MySQL query?

This is a query that totals up every players game results from a game and displays the players who match the conditions.
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses
from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(* ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
dotagames as dg,
games as ga,
dotaplayers as dp,
scores as sc
where dg.winner <> 0
and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h
order by totalscore desc
Now I'm not too sure what's the best way to go but what would in your opinion be to optimize this query?
I run a Q6600 # 2.4ghz, 4gb of ram, 64-bit Linux Ubuntu 9.04 system and this query can take up to 6.7 seconds to run (I do have a huge database).
Also I would like to paginate the results as well and executing extra conditions on top of this query is far too slow....
I use django as a frontend so any methods that include using python +/- django methods would be great. MySQL, Apache2 tweaks are also welcome. And of course, I'm open to changing the query to make it run faster.
Thanks for reading my question; look forward to reading your answers!
Edit: EXPLAIN QUERY RESULTS
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 783 Using filesort
2 DERIVED sc ALL name,name_2 NULL NULL NULL 2099 Using temporary; Using filesort
2 DERIVED gp ref gameid,colour,name name 17 development.sc.name 2
2 DERIVED ga eq_ref PRIMARY,id,id_2 PRIMARY 4 development.gp.gameid 1 Using index
2 DERIVED dg ref gameid,winner gameid 4 development.ga.id 1 Using where
2 DERIVED dp ref gameid_2,colour gameid_2 4 development.ga.id 10 Using where
First of all, the SQL is badly formatted. The most obvious error is the line splitting before each AS clause. Second obvious problem is using implicit joins instead of explicitly using INNER JOIN ... ON ....
Now to answer the actual question.
Without knowing the data or the environment, the first thing I'd look at would be some of the MySQL server settings, such as sort_buffer and key_buffer. If you haven't changed any of these, go read up on them. The defaults are extremely conservative and can often be raised more than ten times their default, particularly on the large iron like you have.
Having reviewed that, I'd be running pieces of the query to see speed and what EXPLAIN says. The effect of indexing can be profound, but MySQL has a "fingers-and-toes" problem where it just can't use more than one per table. And JOINs with filtering can need two. So it has to descend to a rowscan for the other check. But having said that, dicing up the query and trying different combinations will show you where it starts stumbling.
Now you will have an idea where a "tipping point" might be: this is where a small increase in some raw data size, like how much it needs to extract, will result in a big loss of performance as some internal structure gets too big. At this point, you will probably want to raise the temporary tables size. Beware that this kind of optimization is a bit of a black art. :-)
However, there is another approach: denormalization. In a simple implementation, regularly scheduled scripts will run this expensive query from time-to-time and poke the data into a separate table in a structure much closer to what you want to display. There are multiple variations of this approach. It can be possible to keep this up-to-date on-the-fly, either in the application, or using table triggers. At the other extreme, you could allow your application to run the expensive query occasionally, but cache the result for a little while. This is most effective if a lot of people will call it often: even 2 seconds cache on a request that is run 15 times a second will show a visible improvement.
You could find ways of producing the same data by running half-a-dozen queries that each return some of the data, and post-processing the data. You could also run version of your original query that returns more data (which is likely to be much faster because it does less filtering) and post-process that. I have found several times that five simpler, smaller queries can be much faster - an order of magnitude, sometimes two - than one big query that is trying to do it all.
No index will help you since you are scanning entire tables.
As your database grows the query will always get slower.
Consider accumulating the stats : after every game, insert the row for that game, and also increment counters in the player's row, Then you don't need to count() and sum() because the information is available.
select * is bad most times - select only the columns you need
break the select into multiple simple selects, use temporary tables when needed
the sum(case part could be done with a subselect
mysql has a very bad performance with or-expressions. use two selects which you union together
Small Improvement
select *,
(kills / deaths) as killdeathratio,
(totgames - wins) as losses from (select gp.name as name,
gp.gameid as gameid,
gp.colour as colour,
Avg(dp.courierkills) as courierkills,
Avg(dp.raxkills) as raxkills,
Avg(dp.towerkills) as towerkills,
Avg(dp.assists) as assists,
Avg(dp.creepdenies) as creepdenies,
Avg(dp.creepkills) as creepkills,
Avg(dp.neutralkills) as neutralkills,
Avg(dp.deaths) as deaths,
Avg(dp.kills) as kills,
sc.score as totalscore,
Count(1 ) as totgames,
Sum(case
when ((dg.winner = 1 and dp.newcolour < 6) or
(dg.winner = 2 and dp.newcolour > 6))
then 1
else 0
end) as wins
from gameplayers as gp,
( select * from dotagames dg1 where dg.winner <> 0 ) as dg,
games as ga,
dotaplayers as dp,
scores as sc
where and dp.gameid = gp.gameid
and dg.gameid = dp.gameid
and dp.gameid = ga.id
and gp.gameid = dg.gameid
and gp.colour = dp.colour
and sc.name = gp.name
group by gp.name
having totgames >= 30
) as h order by totalscore desc
Changes:
1. count (*) chnaged to count(1)
2. In the FROM, The number of rows are reduced.

Categories

Resources