I am attempting to make an update query run faster in Postgres. The query is relatively simple and I have busted it up to spread it across all of the CPU's on my database server.
UPDATE p797.line a SET p = 5 FROM p797.pt b WHERE a.source = b.node AND a.id >= 0 and a.id < 40000000
where "0" and "40000000" are replaced with different values as you move through all the rows in the table. The "line" table has 1.3 billion records and the pt table has 500 million.
Right now this process runs in about 16 hours. I have other update queries that I need to perform and if each takes 16 hours, the results will take weeks to acquire.
I found something interesting that I would like to try, but am unsure if it can be implemented in my case as I am running queries over a Network.
Slow simple update query on PostgreSQL database with 3 million rows
Here, Le Droid makes reference to COPY, a method which I believe I cannot employ as I am running over a network. They also use BUFFER, of which I do not understand how to employ. Also, both my tables reside in the same database, not as the combination database table and CSV. How can I massage my query to get the gains that #Le Droid mentions? Is there another methodology that I can employ to see time gains? I did see Le Droid mention that HOT only sees marginal gains with lots of cost. Other methods?
It might also be noteworthy that I am creating the queries in Python and sending them to the Postgres database using psycopg2.
EDIT:
Here is an EXPLAIN on the above statement without the ID limitation:
"Update on line a (cost=10665536.12..342338721.96 rows=1381265438 width=116)"
" -> Hash Join (cost=10665536.12..342338721.96 rows=1381265438 width=116)"
" Hash Cond: (a.source= b.node)"
" -> Seq Scan on line a (cost=0.00..52953645.38 rows=1381265438 width=102)"
" -> Hash (cost=8347277.72..8347277.72 rows=126271072 width=22)"
" -> Seq Scan on pt b (cost=0.00..8347277.72 rows=126271072 width=22)"
Frankly I'd extract the data, apply all transformations outside the database, then reload it. So an ETL, but with the E and the L being the same table. The transactional guarantees the database provides do not come cheap, and if I didn't need them I wouldn't want to pay that price in this situation.
dropping all indexes from the table you are updating has a tremendous performance enhancement for updates, like 100x faster. Even if the indexes are not related to the columns you are updating or joining on.
Related
I'm trying to store some measurement data into my postgresql db using Python Django.
So far all good, i've made a docker container with django, and another one with the postgresql server.
However, i am getting close to 2M rows in my measurement table, and queries start to get really slow, while i'm not really sure why, i'm not doing very intense queries.
This query
SELECT ••• FROM "measurement" WHERE "measurement"."device_id" = 26 ORDER BY "measurement"."measure_timestamp" DESC LIMIT 20
for example takes between 3 and 5 seconds to run, depending on which device i query.
I would expect this to run a lot faster, since i'm not doing anything fancy.
The measurement table
id INTEGER
measure_timestamp TIMESTAMP WITH TIMEZONE
sensor_height INTEGER
device_id INTEGER
with indices on id and measure_timestamp.
The server doesn't look too busy, even though it's only 512M memory, i have plenty left during queries.
I configured the postgresql server with shared_buffers=256MB and work_mem=128MB.
The total database is just under 100MB, so it should easily fit.
If i run the query in PgAdmin, i'm seeing a lot of Block I/O, so i suspect it has to read from disk, which is obviously slow.
Could anyone give me a few pointers in the right direction how to find the issue?
EDIT:
Added output of explain analyze on a query. I now added index on the device_id, which helped a lot, but i would expect even quicker query times.
https://pastebin.com/H30JSuWa
Do you have indexes on measure_timestamp and device_id? If the queries always take that form, you might also like multi-column indexes.
Please look at the distribution key of your table. It is possible that the data is sparsely populated hence it affects the performance. Selecting a proper distribution key is very important when you have data of 2M records. For more details read this on why distribution key is important
I'm trying to think of an algorithm to solve this problem I have. It's not a HW problem, but for a side project I'm working on.
There's a table A that has about (order of) 10^5 rows and adds new in the order of 10^2 every day.
Table B has on the order of 10^6 rows and adds new at 10^3 every day. There's a one to many relation from A to B (many B rows for some row in A).
I was wondering how I could do continuous aggregates for this kind of data. I would like to have a job that runs every ~10mins and does this: For every row in A, find every row in B related to it that were created in the last day, week and month (and then sort by count) and save them in a different DB or cache them.
If this is confusing, here's a practical example: Say table A has Amazon products and table B has product reviews. We would like to show a sorted list of products with highest reviews in the last 4hrs, day, week etc. New products and reviews are added at a fast pace, and we'd like the said list to be as up-to-date as possible.
Current implementation I have is just a for loop (pseudo-code):
result = []
for product in db_products:
reviews = db_reviews(product_id=product.id, create>=some_time)
reviews_count = len(reviews)
result[product]['reviews'] = reviews
result[product]['reviews_count'] = reviews_count
sort(result, by=reviews_count)
return result
I do this every hour, and save the result in a json file to serve. The problem is that this doesn't really scale well, and takes a long time to compute.
So, where could I look to solve this problem?
UPDATE:
Thank you for your answers. But I ended up learning and using Apache Storm.
Summary of requirements
Having two bigger tables in a database, you need regularly creating some aggregates for past time periods (hour, day, week etc.) and store the results in another database.
I will assume, that once a time period is past, there are no changes to related records, in other words, the aggregate for past period has always the same result.
Proposed solution: Luigi
Luigi is framework for plumbing dependent tasks and one of typical uses is calculating aggregates for past periods.
The concept is as follows:
write simple Task instance, which defines required input data, output data (called Target) and process to create the target output.
Tasks can be parametrized, typical parameter is time period (specific day, hour, week etc.)
Luigi can stop tasks in the middle and start later. It will consider any task, for which is target already existing to be completed and will not rerun it (you would have to delete the target content to let it rerun).
In short: if the target exists, the task is done.
This works for multiple types of targets like files in local file system, on hadoop, at AWS S3, and also in database.
To prevent half done results, target implementations take care of atomicity, so e.g. files are first created in temporary location and are moved to final destination just after they are completed.
In databases there are structures to denote, that some database import is completed.
You are free to create your own target implementations (it has to create something and provide method exists to check, the result exists.
Using Luigi for your task
For the task you describe you will probably find everything you need already present. Just few tips:
class luigi.postgres.CopyToTable allowing to store records into Postgres database. The target will automatically create so called "marker table" where it will mark all completed tasks.
There are similar classes for other types of databases, one of them using SqlAlchemy which shall probably cover the database you use, see class luigi.contrib.sqla.CopyToTable
At Luigi doc is working example of importing data into sqlite database
Complete implementation is beyond extend feasible in StackOverflow answer, but I am sure, you will experience following:
The code to do the task is really clear - no boilerplate coding, just write only what has to be done.
nice support for working with time periods - even from command line, see e.g. Efficiently triggering recurring tasks. It even takes care of not going too far in past, to prevent generating too many tasks possibly overloading your servers (default values are very reasonably set and can be changed).
Option to run the task on multiple servers (using central scheduler, which is provided with Luigi implementation).
I have processed huge amounts of XML files with Luigi and also made some tasks, importing aggregated data into database and can recommend it (I am not author of Luigi, I am just happy user).
Speeding up database operations (queries)
If your task suffers from too long execution time to perform the database query, you have few options:
if you are counting reviews per product by Python, consider trying SQL query - it is often much faster. It shall be possible to create SQL query which uses count on proper records and returns directly the number you need. With group by you shall even get summary information for all products in one run.
set up proper index, probably on "reviews" table on "product" and "time period" column. This shall speed up the query, but make sure, it does not slow down inserting new records too much (too many indexes can cause that).
It might happen, that with optimized SQL query you will get working solution even without using Luigi.
Data Warehousing? Summary tables are the right way to go.
Does the data change (once it is written)? If it does, then incrementally updating Summary Tables becomes a challenge. Most DW applications do not have that problem
Update the summary table (day + dimension(s) + count(s) + sum(s)) as you insert into the raw data table(s). Since you are getting only one insert per minute, INSERT INTO SummaryTable ... ON DUPLICATE KEY UPDATE ... would be quite adequate, and simpler than running a script every 10 minutes.
Do any reporting from a summary table, not the raw data (the Fact table). It will be a lot faster.
My Blog on Summary Tables discusses details. (It is aimed at bigger DW applications, but should be useful reading.)
I agree with Rick, summary tables make the most sense for you. Update the summary tables every 10 minutes and just pull data from it, as user's request summaries.
Also, make sure that your DB is indexed properly for performance. I'm sure db_products.id set as a unique index. but, also make sure that db_products.create is defined as a DATE or DATETIME and also indexed since you are using it in your WHERE statement.
I've just set up a delta-loading data flow between multiple Mysql DBs and a Porgres DB. It's only copying tens of Mbs every 15mins.
Yet, I'd like to set a process to fully load the data between them in case of emergency...
Python is just crashing and seems not to be fast enough when using SQLachemy etc.
I've read that the best might be to just dump everything from MySQL into CSV and then use file_fdw to load the entire tables into Postgres..
Has anyone faced a similar issue? If yes, how did you proceed?
Long story made short, ORM overhead is killing your performance.
When you're not manipulating the objects involved, it's better to use SQA Core expressions ("SQL Expressions") which are almost as fast as pure SQL.
Solution:
Of course I'm presuming your MySQL and Postgres models have been meticulously synchronized (i.e. values from an object from MySQL are not a problem for creating object in Postgres model and vice versa).
Overview:
get Table objects out of declarative classes
select (SQLAlchemy Expression) from one database
convert rows to dicts
insert into the other database
More or less:
# get tables
m_table = ItemMySQL.__table__
pg_table = ItemPG.__table__
# SQL Expression that gets a range of rows quickly
pg_q = select([pg_table]).where(
and_(
pg_table.c.id >= id_start,
pg_table.c.id <= id_end,
))
# get PG DB rows
eng_pg = DBSessionPG.get_bind()
conn_pg = eng_pg.connect()
result = conn_pg.execute(pg_q)
rows_pg = result.fetchall()
for row_pg in rows_pg:
# convert PG row object into dict
value_d = dict(row_pg)
# insert into MySQL
m_table.insert().values(**value_d)
# close row proxy object and connection, else suffer leaks
result.close()
conn_pg.close()
Background on performance, see accepted answer (by SQA principal author himself):
Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?
Since you seem to have Python crashing, perhaps you're using too much memory? Hence I suggest reading and writing rows in batches.
A further improvement could be using .values to insert a number of rows in one call, see here: http://docs.sqlalchemy.org/en/latest/core/tutorial.html#inserts-and-updates
I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.
I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar field (e.g. "en", "de", "es" etc..)
select * from table is too slow and memory devastating.
I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:
select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999
and so on...
Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?
I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.
You can use the HANDLER statement to traverse a table (or index) in chunks. This is not very portable and works in an "interesting" way with transactions if rows appear and disappear while you're looking at it (hint: you're not going to get consistency) but makes code simpler for some applications.
In general, you are going to get a performance hit, as if your database server is local to the machine, several copies of the data will be necessary (in memory) as well as some other processing. This is unavoidable, and if it really bothers you, you shouldn't use mysql for this purpose.
Aside from having indexes defined on whatever columns you're using to filter the query (language and ID probably, where ID already has an index care of the primary key), no.
First: you should avoid using * if you can specify the columns you need (lang and doc in this case). Second: unless you change your data very often, I don't see the point of storing all
this in a database, especially if you are storing file names. You could use an xml format for example (and read/write with a SAX api)
If you want a DB and something faster than MySQL, you can consider an in-memory databasy such as SQLite or BerkeleyDb, which have both python bindings.
Greetz,
J.
The following query returns data right away:
SELECT time, value from data order by time limit 100;
Without the limit clause, it takes a long time before the server starts returning rows:
SELECT time, value from data order by time;
I observe this both by using the query tool (psql) and when querying using an API.
Questions/issues:
The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
If so, why is there a delay in case 2?
Is there some fundamental RDBMS issue that I do not understand?
Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.
The column time is the primary key, BTW.
EDIT:
I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
and
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();
// Turn use of the cursor on.
st.setFetchSize(50);
The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.
SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.
In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.
I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *
You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.
I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.