sorry for my English in advance.
I am a beginner with Cassandra and his data model. I am trying to insert one million rows in a cassandra database in local on one node. Each row has 10 columns and I insert those only in one column family.
With one thread, that operation took around 3 min. But I would like do the same operation with 2 millions rows, and keeping a good time. Then I tried with 2 threads to insert 2 millions rows, expecting a similar result around 3-4min. bUT i gor a result like 7min...twice the first result. As I check on differents forums, multithreading is recommended to improve performance.
That is why I am asking that question : is it useful to use multithreading to insert data in local node (client and server are in the same computer), in only one column family?
Some informations :
- I use pycassa
- I have separated commitlog repertory and data repertory on differents disks
- I use batch insert for each thread
- Consistency Level : ONE
- Replicator factor : 1
It's possible you're hitting the python GIL but more likely you're doing something wrong.
For instance, putting 2M rows in a single batch would be Doing It Wrong.
Try running multiple clients in multiple processes, NOT threads.
Then experiment with different insert sizes.
1M inserts in 3 mins is about 5500 inserts/sec, which is pretty good for a single local client. On a multi-core machine you should be able to get several times this amount provided that you use multiple clients, probably inserting small batches of rows, or individual rows.
You might consider Redis. Its single-node throughput is supposed to be faster. It's different from Cassandra though, so whether or not it's an appropriate option would depend on your use case.
The time taken doubled because you inserted twice as much data. Is it possible that you are I/O bound?
Related
I'm working on a pathfinding project that use topographic data of huge areas.
In order to reduce the huge memory load, my plan is to pre-process the map data by creating nodes that are saved in a PostgresDB on start-up, and then accessed as needed by the algorithm.
I've created 3 docker containers for that, the postgres DB, Adminer and my python app.
It works as expected with small amount of data, so the communications between the containers or the application isn't a problem.
The way it works is that you give a 2D array, it takes the first row, convert each element in node and save it in the DB using an psycopg2.extras.execute_value before going to the second row, then third...
Once all nodes are registered, it updates each of them by searching for their neighbors and adding their id in the right column. That way it takes longer to pre-process the data, but I have easier access when running the algorithm.
However, I think the DB have trouble processing the data past a certain point. The map I gave comes from a .tif file of 9600x14400, and even when ignoring useless/invalid data, that amount to more than 10 millions of nodes.
Basically, it worked quite slow but okay, until around 90% of the node creation process, where the data stopped being processed. Both python and postgres container were still running and responsive, but there was no more node being created, and the neighbor-linking part of the pre-processing didn't start either.
Also there were no error message in either sides.
I've read that the rows limit in a postgres table is absurdly high, but the table also become really slow once a lot of elements are in it, so could it be that it didn't crash or freeze, but just takes an insane amount of time to complete the remaining node creations request?
Would reducing the batch size even more help in that regard?
Or would maybe splitting the table into multiple smaller ones be better?
My queries and psycopg function I've used were not optimized for the mass inserts and update I was doing.
The changes I've made were:
Reduce batch size from 14k to 1k
Making a larger SELECT queries instead of smaller ones
Creating indexes on importants columns
Changing a normal UPDATE query to the format of an UPDATE FROM with also an executing_value instead of cursor.execute
It made the execution time go from around an estimated 5.5 days to around 8 hours.
I'm trying to force eager evaluation for PySpark, using the count methodology I read online:
spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
spark_df.cache().count()
However, when I try running the code, the cache count part is taking forever to run. My data size is relatively small (2.7GB, 15 mil rows), but after 28 min of running, I decided to kill the job. For comparison, when I use pandas.read_sql() method to read the data, it took only 6 min 43 seconds.
The machine I'm running the code on is pretty powerful, (20 vCPU, 160 GB RAM, Windows OS). I believe I'm missing a step to speed up the count statement.
Any help or suggestions are appreciated.
When you used pandas to read, it will use as much memory as possible from the available memory of the machine (assuming all 160Gb as you mentioned, which is by far larger than the data itself ~3Gb).
However, it's not the same with Spark. When you start your Spark session, typically you would have to mention upfront how much memory per executor (and driver, and application manager if applicable) you'd want to use, and if you don't specify it, it's going to be 1Gb according to the latest Spark documentation. So the first thing you want to do is giving more memory to your executors and driver.
Second, reading from JDBC by Spark is tricky, because slowness or not depends on the number of executors (and tasks), and those numbers depend on how many partitions your RDD (that read from JDBC connection) have, and the numbers of partitions depends on your table, your query, columns, conditions, etc. One way to force changing behavior, to have more partitions, more tasks, more executors, ... is via these configurations: numPartitions, partitionColumn, lowerBound, and
upperBound.
numPartitions is the number of partitions (hence the number of executors will be used)
partitionColumn is an integer type column that Spark would use to target partitioning
lowerBound is the min value of partitionColumn that you want to read
upperBound is the max value of partitionColumn that you want to read
You can read more here https://stackoverflow.com/a/41085557/3441510, but the basic idea is, you want to use a reasonable number of executors (defined by numPartitions), to process an equally distributed chunk of data for each executor (defined by partitionColumn, lowerBound and upperBound).
working in postgresql I have a cartesian join producing ~4 million rows.
The join takes ~5sec and the write back to the DB takes ~1min 45sec.
The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.
Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.
Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).
While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).
Slower methods include:
36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)
13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)
13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)
Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541).
This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.
So, my questions:
Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?
Thanks.
Giles
EDIT:
On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO).
I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.
Answering Q 1 myself:
It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:
1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.
2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.
3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.
I hope this helps anyone else who comes across this question from a search.
I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.
Thanks, Giles
I have a huge database (~100 variables with a few million rows) consisting of stock data. I managed to connect python with the database via sqlalchemy (postgreql+psycopg2). I am running it all on the cloud.
In principle I want to do a few things:
1) Regression of all possible combinations: I am running a simple regression of each stock, i.e. ABC on XYZ AND also XYZ on ABC, this across the n=100 stocks, resulting in n(n+1) / 2 combinations.
-> I think of a function that calls in the pairs of stocks, does the two regressions and compares the results and picks one based on some criteria.
My question: Is there an efficient way to call in the "factorial"?
2) Rolling Windows: To avoid an overload of data, I thought to only call the dataframe of investigation, i.e. 30days, and then roll over each day, meaning my periods are:
1: 1D-30D
2: 2D-31D and so on
Meaning I always drop the first day and add another row at the end of my dataframe. So meaning I have two steps, drop the first day and read in the next row from my database.
My question: Is this a meaningful way or does Python has something better in its sleeve? How would you do it?
3) Expanding windows: Instead of dropping the first row and add another one, I keep the 30 days and add another 30days and then run my regression. Problem here, at some point I would embrace all the data which will probably be too big for the memory?
My question: What would be a workaround here?
4) As I am running my analysis on the cloud (with a few more cores than my own pc) in fact I could use multithreading, sending "batch" jobs and let Python do things in parallel. I thought of splitting my dataset in 4x 25 stocks and let it run in parallel (so vertical split), or should I better split horizontally?
Additionally I am using Jupyter; I am wondering how to best approach here, usually I have a shell script calling my_program.py. Is this the same here?
Let me try to give answers categorically and also note my observations.
From your description, I suppose you have taken each stock scrip as one variable and you are trying to perform pairwaise linear regression amongst them. Good news about this - it's highly parallizable. All you need to do is generate unique combinations of all possible pairings and perform your regressions and then only to keep those models which fit your criteria.
Now as stocks are your variables, I am assuming rows are their prices or something similar values but definitely some time series data. If my assumption is correct then there is a problem in rolling window approach. In creating these rolling windows what you are implicitly doing is using a data sampling method called 'bootstrapping' which uses random but repeatitive sampling. But due to just rolling your data you are not using random sampling which might create problems for your regression results. At best the model may simply be overtrained, at worst, I cannot imagine. Hence, drop this appraoch. Plus if it's a time series data then the entire concept of windowing would be questionable anyway.
Expanding windows are no good for the same reasons stated above.
About memory and processibility - I think this is an excellent scenario where one can use Spark. It is exactly built for this purpose and has excellent support for python. Millions of data points are no big deal for Spark. Plus, you would be able to massively parallelize your operations. Being on cloud infrastructure also gives you advantage about configurability and expandability without headache. I don't know why people like to use Jupyter even for batch tasks like these but if you are hell-bent on using it, then PySpark kernel is also supported by Jupyter. Vertical split would be right approach here probably.
Hope these answer your questions.
I am writing a Django application that will have entries entered by users of the site. Now suppose that everything goes well, and I get the expected number of visitors (unlikely, but I'm planning for the future). This would result in hundreds of millions of entries in a single PostgreSQL database.
As iterating through such a large number of entries and checking their values is not a good idea, I am considering ways of grouping entries together.
Is grouping entries in to sets of (let's say) 100 a better idea for storing this many entries? Or is there a better way that I could optimize this?
Store one at a time until you absolutely cannot anymore, then design something else around your specific problem.
SQL is a declarative language, meaning "give me all records matching X" doesn't tell the db server how to do this. Consequently, you have a lot of ways to help the db server do this quickly even when you have hundreds of millions of records. Additionally RDBMSs are optimized for this problem over a lot of years of experience so to a certain point, you will not beat a system like PostgreSQL.
So as they say, premature optimization is the root of all evil.
So let's look at two ways PostgreSQL might go through a table to give you the results.
The first is a sequential scan, where it iterates over a series of pages, scans each page for the values and returns the records to you. This works better than any other method for very small tables. It is slow on large tables. Complexity is O(n) where n is the size of the table, for any number of records.
So a second approach might be an index scan. Here PostgreSQL traverses a series of pages in a b-tree index to find the records. Complexity is O(log(n)) to find each record.
Internally PostgreSQL stores the rows in batches with fixed sizes, as pages. It already solves this problem for you. If you try to do the same, then you have batches of records inside batches of records, which is usually a recipe for bad things.