Using pyodbc, I wrote a Python program to extract data from Oracle and loads into SQL Server. The extraction from Oracle was instant, but there are some tables taking very long time to load, especially the tables with many columns (over 100+ columns) with a few of those columns at VARCHAR(4000) size (I am running pyodbc's executemany for the INSERT).
Turning fast_executemany = True seem to make the INSERT even slower. When turned off, loading a table of 40k rows took about 3minutes; and when turned on, loading the same amount of rows took about 15minutes.
Not sure if this means anything, but I did turned on SQL Profiler during each try and here is what I found: When it is turned off, the backend is doing a bunch of "sp_prepexec" and "sp_unprepare" for each inserts; and when it is turned on, the backend just did one time of "sp_prepare" and then a bunch of "sp_execute".
Any idea why the fast_executemany is not speeding up the INSERT, and in fact is even much longer?
Update: I was able to resolve my problem by limiting how many rows get inserted each time. I set my batch size for each INSERT operation to only 1000 rows at a time, and now the same INSERT operation of 40k rows took about 40secs, comparing to without setting a batch size of which took 15minutes.
I am guessing fast_executemany puts everything into memory before executing that INSERT, but if my column sizes are huge and if there are many rows to be inserted for each operation, it will put lot of burden on the memory and hence get much slower (?).
Related
I'm working on a pathfinding project that use topographic data of huge areas.
In order to reduce the huge memory load, my plan is to pre-process the map data by creating nodes that are saved in a PostgresDB on start-up, and then accessed as needed by the algorithm.
I've created 3 docker containers for that, the postgres DB, Adminer and my python app.
It works as expected with small amount of data, so the communications between the containers or the application isn't a problem.
The way it works is that you give a 2D array, it takes the first row, convert each element in node and save it in the DB using an psycopg2.extras.execute_value before going to the second row, then third...
Once all nodes are registered, it updates each of them by searching for their neighbors and adding their id in the right column. That way it takes longer to pre-process the data, but I have easier access when running the algorithm.
However, I think the DB have trouble processing the data past a certain point. The map I gave comes from a .tif file of 9600x14400, and even when ignoring useless/invalid data, that amount to more than 10 millions of nodes.
Basically, it worked quite slow but okay, until around 90% of the node creation process, where the data stopped being processed. Both python and postgres container were still running and responsive, but there was no more node being created, and the neighbor-linking part of the pre-processing didn't start either.
Also there were no error message in either sides.
I've read that the rows limit in a postgres table is absurdly high, but the table also become really slow once a lot of elements are in it, so could it be that it didn't crash or freeze, but just takes an insane amount of time to complete the remaining node creations request?
Would reducing the batch size even more help in that regard?
Or would maybe splitting the table into multiple smaller ones be better?
My queries and psycopg function I've used were not optimized for the mass inserts and update I was doing.
The changes I've made were:
Reduce batch size from 14k to 1k
Making a larger SELECT queries instead of smaller ones
Creating indexes on importants columns
Changing a normal UPDATE query to the format of an UPDATE FROM with also an executing_value instead of cursor.execute
It made the execution time go from around an estimated 5.5 days to around 8 hours.
working in postgresql I have a cartesian join producing ~4 million rows.
The join takes ~5sec and the write back to the DB takes ~1min 45sec.
The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.
Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.
Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).
While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).
Slower methods include:
36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)
13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)
13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)
Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541).
This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.
So, my questions:
Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?
Thanks.
Giles
EDIT:
On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO).
I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.
Answering Q 1 myself:
It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:
1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.
2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.
3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.
I hope this helps anyone else who comes across this question from a search.
I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.
Thanks, Giles
I've got 500K rows I want to insert into PostgreSQL using SQLAlchemy.
For speed, I'm inserting them using session.bulk_insert_mappings().
Normally, I'd break up the insert into smaller batches to minimize session bookkeeping. However, bulk_insert_mappings() uses dicts and bypasses a lot of the traditional session bookkeeping.
Will I still see a speed improvement if I break the insert up into smaller discrete batches, say doing an insert every 10K rows?
If so, should I close the PG transaction after every 10K rows, or leave it open the whole time?
In my experience, you'll see substantial performance improvements if you use INSERT INTO tbl (column1, column2) VALUES (...), (...), ...; as opposed to bulk_insert_mappings, which uses executemany. In this case you'll want to batch the rows at least on a statement level for sanity.
SQLAlchemy supports generating a multi-row VALUES clause for a single INSERT statement, so you don't have to hand-craft the statement.
Committing between batches probably won't have much of an effect on the performance, but the reason to do it would be to not keep an open transaction for too long, which could impact other transactions running on the server.
You can also experiment with using COPY to load it into a temporary table, then INSERTing from that table.
I have a large SQLite db where I am joining a 3.5M-row table onto itself. I use SQLite since it is the serialization format of my python3 application and the flatfile format is important in my workflow. When iterating over the rows of this join (around 55M rows) using:
cursor.execute('SELECT DISTINCT p.pid, pp.pname, pp.pid FROM proteins'
'AS p JOIN proteins AS pp USING(pname) ORDER BY p.pid')
for row in cursor:
# do stuff with row.
EXPLAIN QUERY PLAN gives the following:
0|0|0|SCAN TABLE proteins AS p USING INDEX pid_index (~1000000 rows)
0|1|1|SEARCH TABLE proteins AS pp USING INDEX pname_index (pname=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
sqlite3 errors with "database or disk is full" after say 1.000.000 rows, which seems to indicate a full SQLite on-disk tempstore. Since I have enough RAM on my current box, that can be solved by setting the tempstore to in memory, but it's suboptimal since in that case all the RAM seems to be used up and I tend to run 4 or so of these processes in parallel. My (probably incorrect) assumption was that the iterator was a generator and would not put a large load on the memory, unlike e.g. fetchall which would load all rows. However I now run out of diskspace (on a small SSD scratch disk) and assuming that SQLite needs to store the results somewhere.
A way around this may be to run chunks of SELECT ... LIMIT x OFFSET y queries, but they get slower for each time a bigger OFFSET is used. Is there any other way to run this? What is stored in these temporary files? They seem to grow the further I iterate.
0|0|0|USE TEMP B-TREE FOR DISTINCT
Here's what's using the disk.
In order to support DISTINCT, SQLite has to store what rows already appeared in the query. For a large number of results, this set can grow huge. So to save on RAM, SQLite will temporarily store the distinct set on disk.
Removing the DISTINCT clause is an easy way to avoid the issue, but it changes the meaning of the query; you can now get duplicate rows. If you don't mind that, or you have unique indices or some other way of ensuring that you never get duplicates, then that won't matter.
What you are trying to do with SQLite3 is a very bad idea, let me try to explain why.
You have the raw data on disk where it fits and is readable.
You generate a result inside of SQLite3 which expands greatly.
You then try to transfer this very large dataset through an sql connector.
Relational databases in general is not made for this kind of operation. SQLite3 is no exception. Relational databases were made for small quick queries that live for a fraction of a second and that returns a couple of rows.
You would be better off using another tool.
Reading the whole dataset into python using Pandas for instance is my recommended solution. Also using itertools is a good idea.
sorry for my English in advance.
I am a beginner with Cassandra and his data model. I am trying to insert one million rows in a cassandra database in local on one node. Each row has 10 columns and I insert those only in one column family.
With one thread, that operation took around 3 min. But I would like do the same operation with 2 millions rows, and keeping a good time. Then I tried with 2 threads to insert 2 millions rows, expecting a similar result around 3-4min. bUT i gor a result like 7min...twice the first result. As I check on differents forums, multithreading is recommended to improve performance.
That is why I am asking that question : is it useful to use multithreading to insert data in local node (client and server are in the same computer), in only one column family?
Some informations :
- I use pycassa
- I have separated commitlog repertory and data repertory on differents disks
- I use batch insert for each thread
- Consistency Level : ONE
- Replicator factor : 1
It's possible you're hitting the python GIL but more likely you're doing something wrong.
For instance, putting 2M rows in a single batch would be Doing It Wrong.
Try running multiple clients in multiple processes, NOT threads.
Then experiment with different insert sizes.
1M inserts in 3 mins is about 5500 inserts/sec, which is pretty good for a single local client. On a multi-core machine you should be able to get several times this amount provided that you use multiple clients, probably inserting small batches of rows, or individual rows.
You might consider Redis. Its single-node throughput is supposed to be faster. It's different from Cassandra though, so whether or not it's an appropriate option would depend on your use case.
The time taken doubled because you inserted twice as much data. Is it possible that you are I/O bound?