Dataframe writing to Postgresql poor performance - python

working in postgresql I have a cartesian join producing ~4 million rows.
The join takes ~5sec and the write back to the DB takes ~1min 45sec.
The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.
Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.
Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).
While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).
Slower methods include:
36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)
13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)
13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)
Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541).
This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.
So, my questions:
Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?
Thanks.
Giles
EDIT:
On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO).
I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.

Answering Q 1 myself:
It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:
1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.
2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.
3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.
I hope this helps anyone else who comes across this question from a search.
I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.
Thanks, Giles

Related

Why is the compute() method slow for Dask dataframes but the head() method is fast?

So I'm a newbie when it comes to working with big data.
I'm dealing with a 60GB CSV file so I decided to give Dask a try since it produces pandas dataframes. This may be a silly question but bear with me, I just need a little push in the right direction...
So I understand why the following query using the "compute" method would be slow(lazy computation):
df.loc[1:5 ,'enaging_user_following_count'].compute()
btw it took 10 minutes to compute.
But what I don't understand is why the following code using the "head" method prints the output in less than two seconds (i.e., In this case, I'm asking for 250 rows while the previous code snippet was just for 5 rows):
df.loc[50:300 , 'enaging_user_following_count'].head(250)
Why doesn't the "head" method take a long time? I feel like I'm missing something here because I'm able to pull a huge number of rows in a way shorter time than when using the "compute" method.
Or is the compute method used in other situations?
Note: I tried to read the documentation but there was no explanation to why head() is fast.
I played around with this a bit half a year ago. .head() is not checking all your partitions but simply checks the first partition. There is no synchronization overhead etc. so it is quite fast, but it does not take the whole Dataset into account.
You could try
df.loc[-251: , 'enaging_user_following_count'].head(250)
IIRC you should get the last 250 entries of the first partition instead of the actual last indices.
Also if you try something like
df.loc[conditionThatIsOnlyFulfilledOnPartition3 , 'enaging_user_following_count'].head(250)
you get an error that head could not find 250 samples.
If you actually just want the first few entries however it is quite fast :)
This processes the whole dataset
df.loc[1:5, 'enaging_user_following_count'].compute()
The reason is, that loc is a label-based selector, and there is no telling what labels exist in which partition (there's no reason that they should be monotonically increasing). In the case that the index is well-formed, you may have useful values of df.divisions, and in this case Dask should be able to pick only the partitions of your data that you need.

Editing a big database in Excel - any easy to learn language that provides array manipulation except for VBA? Python? Which library?

I am currently trying to develop macros/programs to help me edit a big database in Excel.
Just recently I successfully wrote a custom macro in VBA, which stores two big arrays into memory, in memory it compares both arrays by only one column in each (for example by names), then the common items that reside in both arrays are copied into another temporary arrays TOGETHER with other entries in the same row of the array. So if row(11) name was "Tom", and it is common for both arrays, and next to Tom was his salary of 10,000 and his phone number, the entire row would be copied.
This was not easy, but I got to it somehow.
Now, this works like a charm for arrays as big as 10,000 rows x 5 columns + another array of the same size 10,000 rows x 5 columns. It compares and writes back to a new sheet in a few seconds. Great!
But now I tried a much bigger array with this method, say 200,000 rows x 10 columns + second array to be compared 10,000 rows x 10 columns...and it took a lot of time.
Problem is that Excel is only running at 25% CPU - I checked that online it is normal.
Thus, I am assuming that to get a better performance I would need to use another 'tool', in this case another programming language.
I heard that Python is great, Python is easy etc. but I am no programmer, I just learned a few dozen object names and I know some logic so I got around in VBA.
Is it Python? Or perhaps changing the programming language won't help? It is really important to me that the language is not too complicated - I've seen C++ and it stings my eyes, I literally have no idea what is going on in those codes.
If indeed python, what libraries should I start with? Perhaps learn some easy things first and then go into those arrays etc.?
Thanks!
I have no intention of condescending but anything I say would sound like condescending, so so be it.
The operation you are doing is called join. It's a common operation in any kind of database. Unfortunately, Excel is not a database.
I suspect that you are doing NxM operation in Excel. 200,000 rows x 10,000 rows operation quickly explodes. Pick a key in N, search a row in M, and produce result. When you do this, regardless of computer language, the computation order becomes so large that there is no way to finish the task in reasonable amount of time.
In this case, 200,000 rows x 10,000 rows require about 5,000 lookup per every row on average in 200,000 rows. That's 1,000,000,000 times.
So, how do the real databases do this in reasonable amount of time? Use index. When you look into this 10,000 rows of table, what you are looking for is indexed so searching a row becomes log2(10,000). The total order of computation becomes N * log2(M) which is far more manageable. If you hash the key, the search cost is almost O(1) - meaning it's constant. So, the computation order becomes N.
What you are doing probably is, in real database term, full table scan. It is something to avoid for real database because it is slow.
If you use any real (SQL) database, or programming language that provides a key based search in dataset, your join will become really fast. It's nothing to do with any programming language. It is really a 101 of computer science.
I do not know anything about what Excel can do. If Excel provides some facility to lookup a row based on indexing or hashing, you may be able to speed it up drastically.
Ideally you want to design a database (there are many such as SQLite, PostgreSQL, MySQL etc.) and stick your data into it. SQL is the language of talking to a database (DML data manipulation language) or creating/editing the structure of the database (DDL data definition language).
Why a database? You’ll get data validation and the ability to query data with many relationships (such as One to Many, e.g. one author can have many books but you’ll have an Author table and a Book table and will need to join these).
Pandas works not just with databases but CSV and text files, Microsoft Excel, HDF5 and is great for reading and writing to these in memory structures as well as merging, joining, slicing the data. The quickest way to what you want is likely read the data you have into panda dataframes and then manipulate from there. This makes a database optional though recommended. See Pandas Merging 101 for an idea of what you can do with pandas.
Another python tool you could use is SQLAlchemy which is an ORM object relational mapper (converts say a row in an Author table to an Author class object in python). Whilst it’s important to know SQL and database principles you don’t need to use SQL statements directly when using SQLAlchemy.
Each of these areas are huge like the ocean. You can dip your toes into each but if you wade in too deep you’ll want to know how to swim. I have fist-sized books on each to give (that I’ve not finished) you a rough idea what I mean by this.
A possible roadmap may look like:
Database (optional but recommended):
Learn about relational data
Learn database design
Learn SQL
Pandas (highly recommended):
Learn to read and write data (to excel / database)
Learn to merge, join, concatenate and update a DataFrame

temp store full when iterating over large amount sqlite3 records

I have a large SQLite db where I am joining a 3.5M-row table onto itself. I use SQLite since it is the serialization format of my python3 application and the flatfile format is important in my workflow. When iterating over the rows of this join (around 55M rows) using:
cursor.execute('SELECT DISTINCT p.pid, pp.pname, pp.pid FROM proteins'
'AS p JOIN proteins AS pp USING(pname) ORDER BY p.pid')
for row in cursor:
# do stuff with row.
EXPLAIN QUERY PLAN gives the following:
0|0|0|SCAN TABLE proteins AS p USING INDEX pid_index (~1000000 rows)
0|1|1|SEARCH TABLE proteins AS pp USING INDEX pname_index (pname=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
sqlite3 errors with "database or disk is full" after say 1.000.000 rows, which seems to indicate a full SQLite on-disk tempstore. Since I have enough RAM on my current box, that can be solved by setting the tempstore to in memory, but it's suboptimal since in that case all the RAM seems to be used up and I tend to run 4 or so of these processes in parallel. My (probably incorrect) assumption was that the iterator was a generator and would not put a large load on the memory, unlike e.g. fetchall which would load all rows. However I now run out of diskspace (on a small SSD scratch disk) and assuming that SQLite needs to store the results somewhere.
A way around this may be to run chunks of SELECT ... LIMIT x OFFSET y queries, but they get slower for each time a bigger OFFSET is used. Is there any other way to run this? What is stored in these temporary files? They seem to grow the further I iterate.
0|0|0|USE TEMP B-TREE FOR DISTINCT
Here's what's using the disk.
In order to support DISTINCT, SQLite has to store what rows already appeared in the query. For a large number of results, this set can grow huge. So to save on RAM, SQLite will temporarily store the distinct set on disk.
Removing the DISTINCT clause is an easy way to avoid the issue, but it changes the meaning of the query; you can now get duplicate rows. If you don't mind that, or you have unique indices or some other way of ensuring that you never get duplicates, then that won't matter.
What you are trying to do with SQLite3 is a very bad idea, let me try to explain why.
You have the raw data on disk where it fits and is readable.
You generate a result inside of SQLite3 which expands greatly.
You then try to transfer this very large dataset through an sql connector.
Relational databases in general is not made for this kind of operation. SQLite3 is no exception. Relational databases were made for small quick queries that live for a fraction of a second and that returns a couple of rows.
You would be better off using another tool.
Reading the whole dataset into python using Pandas for instance is my recommended solution. Also using itertools is a good idea.

Reading a large table with millions of rows from Oracle and writing to HDF5

I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.
I have attempted the following:
Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore. I created a dtype definition and provided the maximum string sizes.
However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore, I run into some problems.
pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.
If I try to use cursor.fecthmany() with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype)
however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.
Appreciate if you guys could offer your thoughts and point me in the right direction.
Well, the only practical solution for now is to use PyTables directly since it's designed for out-of-memory operation... It's a bit tedious but not that bad:
http://www.pytables.org/moin/HintsForSQLUsers#Insertingdata
Another approach, using Pandas, is here:
"Large data" work flows using pandas
Okay, so I don't have much experience with oracle databases, but here's some thoughts:
Your access time for any particular records from oracle are slow, because of a lack of indexing, and the fact you want data in timestamp order.
Firstly, you can't enable indexing for the database?
If you can't manipulate the database, you can presumably request a found set that only includes the ordered unique ids for each row?
You could potentially store this data as a single array of unique ids, and you should be able to fit into memory. If you allow 4k for every unique key (conservative estimate, includes overhead etc), and you don't keep the timestamps, so it's just an array of integers, it might use up about 1.1GB of RAM for 3 million records. That's not a whole heap, and presumably you only want a small window of active data, or perhaps you are processing row by row?
Make a generator function to do all of this. That way, once you complete iteration it should free up the memory, without having to del anything, and it also makes your code easier to follow and avoids bloating the actual important logic of your calculation loop.
If you can't store it all in memory, or for some other reason this doesn't work, then the best thing you can do, is work out how much you can store in memory. You can potentially split the job into multiple requests, and use multithreading to send a request once the last one has finished, while you process the data into your new file. It shouldn't use up memory, until you ask for the data to be returned. Try and work out if the delay is the request being fulfilled, or the data being downloaded.
From the sounds of it, you might be abstracting the database, and letting pandas make the requests. It might be worth looking at how it's limiting the results. You should be able to make the request for all the data, but only load the results one row at a time from the database server.

Insert performance with Cassandra

sorry for my English in advance.
I am a beginner with Cassandra and his data model. I am trying to insert one million rows in a cassandra database in local on one node. Each row has 10 columns and I insert those only in one column family.
With one thread, that operation took around 3 min. But I would like do the same operation with 2 millions rows, and keeping a good time. Then I tried with 2 threads to insert 2 millions rows, expecting a similar result around 3-4min. bUT i gor a result like 7min...twice the first result. As I check on differents forums, multithreading is recommended to improve performance.
That is why I am asking that question : is it useful to use multithreading to insert data in local node (client and server are in the same computer), in only one column family?
Some informations :
- I use pycassa
- I have separated commitlog repertory and data repertory on differents disks
- I use batch insert for each thread
- Consistency Level : ONE
- Replicator factor : 1
It's possible you're hitting the python GIL but more likely you're doing something wrong.
For instance, putting 2M rows in a single batch would be Doing It Wrong.
Try running multiple clients in multiple processes, NOT threads.
Then experiment with different insert sizes.
1M inserts in 3 mins is about 5500 inserts/sec, which is pretty good for a single local client. On a multi-core machine you should be able to get several times this amount provided that you use multiple clients, probably inserting small batches of rows, or individual rows.
You might consider Redis. Its single-node throughput is supposed to be faster. It's different from Cassandra though, so whether or not it's an appropriate option would depend on your use case.
The time taken doubled because you inserted twice as much data. Is it possible that you are I/O bound?

Categories

Resources