Python real time parallel / distributed data processing - python

I have a Python application that processes in-memory-data. It provides <1 second response, by querying ~1 million records and then aggregating the result set.
What would be the best Python framework(s) to make this application more scalable ?
Here are more details :
Data is loaded from a single table on disk which is loaded into memory as numpy arrays and custom indexes using dictionaries.
This application starts breaching the 1 second time limit when the number of records grow more than > 5 million. Search part / locating the indexes takes 100 ms only. I see lot of time (900 to 2000 milli secs) is spent in just summing up the result set.
Also able to see CPU cores & RAM are not used to their full capacity. I see each core is used only upto 20% and a plenty of memory is free.
I just read a long list of python frameworks on distributed computing. Looking for specific solutions for real time responses, by:
making a better usage of available CPU & RAM in single machine through parallel processing to stay within < 1 second response time.
later by extending it beyond a single machine, to support even ~100 million records. This data is in a single table/file that can be horizontally partitioned across many machines, and these machines can work independently on their own data.
Suggestions from what you have "seen" working from your past experience is greatly appreciated.

Related

Pyspark - df.cache().count() taking forever to run

I'm trying to force eager evaluation for PySpark, using the count methodology I read online:
spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
spark_df.cache().count()
However, when I try running the code, the cache count part is taking forever to run. My data size is relatively small (2.7GB, 15 mil rows), but after 28 min of running, I decided to kill the job. For comparison, when I use pandas.read_sql() method to read the data, it took only 6 min 43 seconds.
The machine I'm running the code on is pretty powerful, (20 vCPU, 160 GB RAM, Windows OS). I believe I'm missing a step to speed up the count statement.
Any help or suggestions are appreciated.
When you used pandas to read, it will use as much memory as possible from the available memory of the machine (assuming all 160Gb as you mentioned, which is by far larger than the data itself ~3Gb).
However, it's not the same with Spark. When you start your Spark session, typically you would have to mention upfront how much memory per executor (and driver, and application manager if applicable) you'd want to use, and if you don't specify it, it's going to be 1Gb according to the latest Spark documentation. So the first thing you want to do is giving more memory to your executors and driver.
Second, reading from JDBC by Spark is tricky, because slowness or not depends on the number of executors (and tasks), and those numbers depend on how many partitions your RDD (that read from JDBC connection) have, and the numbers of partitions depends on your table, your query, columns, conditions, etc. One way to force changing behavior, to have more partitions, more tasks, more executors, ... is via these configurations: numPartitions, partitionColumn, lowerBound, and
upperBound.
numPartitions is the number of partitions (hence the number of executors will be used)
partitionColumn is an integer type column that Spark would use to target partitioning
lowerBound is the min value of partitionColumn that you want to read
upperBound is the max value of partitionColumn that you want to read
You can read more here https://stackoverflow.com/a/41085557/3441510, but the basic idea is, you want to use a reasonable number of executors (defined by numPartitions), to process an equally distributed chunk of data for each executor (defined by partitionColumn, lowerBound and upperBound).

How to tune the parameters for a better insert performance in Postgresql?

The dataset is consist of two folders, each of them has 180 CSV files. My python script will read two CSV files from those two folders each time for 180 loops.
After reading the data and some data processing, the data from two CSV files is merged to one list, with the shape[(1,1), (1,2), (1,3), ...]. There would be about 3,000,000,000 tuples in the list.
I first revised some parameters (as below) of postgresql.conf, and tried to insert the whole list to DB at once by executemany, it took like 30,000 seconds, which would take me two months to read them out.
Then I tried to split the list to 30 sub-lists and read them one after another. Reading sub-list each took about 230 seconds, which is much faster, but will still coast 15 days.
I will probably use multiprocessing since I have more RAM and CPUs to use (one process uses 35 GB memory, I guess I will try to use three processes). But I would like to ask, can I somehow optimize the parameters of DB for a better insert performance in this step? Because the psql documentation did not talk much about the parameter tuning, except to set the share_buffers to 25% of RAM would enhance the performance.
Thanks in advance!
My current parameters, that are not followed by default:
shared_buffers = 10 GB
temp_buffers = 1GB
max_wal_size = 1GB
effective_cache_size = 20GB
work_mem = 2GB
maintenance_work_mem = 4GB
ENV: Redhat, RAM: 150 GB, NUmber of kerns: 40

How does dask work for larger than memory datasets

Would anyone be able to tell me how dask works for larger than memory dataset in simple terms. For example I have a dataset which is 6GB and 4GB RAM with 2 Cores. How would dask go about loading the data and doing a simple calculation such as sum of a column.
Does dask automatically check the size of the memory and chunk the dataset to smaller than memory pieces. Then, once requested to compute bring chunk by chunk into memory and do the computation using each of the available cores. Am I right on this.
Thanks
Michael
By "dataset" you are apparently referring to a dataframe. Let's consider two file formats from which you may be loading: CSV and parquet.
For CSVs, there is no inherent chunking mechanism in the file, so you, the user, can choose the bytes-per-chunk appropriate for your application using dd.read_csv(path, blocksize=..), or allow Dask to try to make a decent guess; "100MB" may be a fine size to try.
For parquet, the format itself has internal chunking of the data, and Dask will make use of this pattern in loading the data
In both cases, each worker will load one chunk at a time, and calculate the column sum you have asked for. Then, the loaded data will be discarded to make space for the next one, only keeping the results of the sum in memory (a single number for each partition). If you have two workers, two partitions will be in memory and processed at the same time. Finally, all the sums are added together.
Thus, each partition should comfortably fit into memory - not be too big - but the time it takes to load and process each should be much longer than the overhead imposed by scheduling the task to run on a worker (the latter <1ms) - not be too small.

Multiprocessing with large no of files

I am trying to solve a problem. I would appreciate your valuable input on this.
Problem statement:
I am trying to read a lot of files (of the order of 10**6) in the same base directory. Each file has the name that matches the pattern (YYYY-mm-dd-hh), and the content of the files are as follows
mm1, vv1
mm2, vv2
mm3, vv3
.
.
.
where mm is the minute of the day and vv” is some numeric value with respect to that minute. I need to find, given a start-time (ex. 2010-09-22-00) and an end-time (ex. 2017-09-21-23), the average of all vv’s.
So basically user will provide me with a start_date and end_date, and I will have to get the average of all the files in between the given date range. So my function would be something like this:
get_average(start_time, end_time, file_root_directory):
Now, what I want to understand is how can I use multiprocessing to average out the smaller chunks, and then build upon that to get the final values.
NOTE: I am not looking for linear solution. Please advise me on how do I break the problem in smaller chunks and then sum it up to find the average.
I did tried using multiprocessing module in python by creating a pool of 4 processes, but I am not able to figure out how do I retain the values in memory and add the result together for all the chunks.
You process is going to be I/O bound.
Multiprocessing may not be very useful, if not counterproductive.
Moreover your storage system, base on enormous number of small files, is not the best. You should look at a time serie database such as influxdb.
Given that the actual processing is trivial—a sum and count of each file—using multiple processes or threads is not going to gain much. This is because 90+% of the effort is opening each file and transferring into memory its content.
However, the most obvious partitioning would be based on some per-data-file scheme. So if the search range is (your example) 2010-09-22-00 through 2017-09-21-23, then there are seven years with (maybe?) one file per hour for a total of 61,368 files (including two leap days).
61 thousand processes do not run very effectively on one system—at least so far. (Probably it will be a reasonable capability some years from now.) But for a real (non-supercomputing) system, partitioning the problem into a few segments, perhaps twice or thrice the number of CPUs available to do the work. This desktop computer has four cores, so I would first try 12 processes where each independently computes the sum and count (number of samples present, if variable) of 1/12 of the files.
Interprocess communication can be eliminated by using threads. Or for process oriented approach, setting up a pipe to each process to receive the results is a straightforward affair.

Python vs Scala (for Spark jobs)

I am pretty new to Spark, currently exploring it by playing with pyspark and spark-shell.
So here is the situation, I run same spark jobs with pyspark and spark-shell.
This is from pyspark:
textfile = sc.textFile('/var/log_samples/mini_log_2')
textfile.count()
And this one from spark-shell:
textfile = sc.textFile("file:///var/log_samples/mini_log_2")
textfile.count()
I tried both of them several times, first (python) one takes 30-35 seconds to complete while second one (scala) takes about 15 seconds. I am curious about what may cause this different performance results? Is it because of choice of language or spark-shell do something in background that pyspark don't?
UPDATE
So I did some tests on larger datasets, about 550 GB (zipped) in total. I am using Spark Standalone as master.
I observed that while using pyspark, tasks are equally shared among executors. However when using spark-shell, tasks are not shared equally. More powerful machines get more tasks while weaker machines gets fewer tasks.
With spark-shell, job is finished in 25 minutes and with pyspark it is around 55 minutes. How can I make Spark Standalone assign tasks with pyspark, as it assigns tasks with spark-shell?
Using python has some overhead, but it's significance depends on what you're doing.
Though recent reports indicate the overhead isn't very large (specifically for the new DataFrame API)
some of the overhead you encounter relates to constant per job overhead - which is almost irrelevant for large jobs.
You should to do a sample benchmark with a larger data set, and see if the overhead is a constant addition, or if it's proportional to the data size.
Another potential bottleneck is operations that apply a python function for each element (map, etc.) - if these operations are relevant for you, you should test them too.

Categories

Resources