Will the for loop effect the speed in pyspark dataframe

Will the for loop effect the speed in pyspark dataframe - python

I have this code which splits the dataframe in 10000 rows and writes to file.
I tried instance with z1d with 24cpu and 192GB but even that didn't do much speed and for 1 million rows it took 9 mins.
This is code
total = df2.count()
offset = 10000
counter = int(total/offset) + 1
idxDf = df.withColumn("idx", monotonically_increasing_id())
for i in range(0, counter):
lower = i * offset
upper = lower + offset
filter = f"idx > {lower} and idx < {upper}"
ddf = idxDf.filter(filter)
ddf2 = ddf.drop("idx")
ddf2.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
Is there any way i can make in it. Currently i am using single master node only. I have 100 million rows and want to know how fast i can do that with emr.
Look like my normal python code is also able to do the same stuff in same minutes

A few problems with what you’re trying to do here:
Stop trying to write pyspark code as if it’s normal python code. It isn’t. Read up on exactly how spark works first and foremost. You’ll have more success if you change the way you program when you use spark, not try to get spark to do what you want in the way you want.
Avoid for loops with Spark wherever possible. for loops only work within native python, so you’re not utilising spark when you start one. Which means one CPU on one Spark node will run the code.
Python is, by default, single threaded. Adding more CPUs will do literally nothing to performance for native python code (ie your for loop) unless you rewrite your code for either (a) multi-threaded processing (b) distributed processing (ie spark).
You only have one master node (and I assume zero slaves nodes). That’s going to take aaaaaaggggggggeeeessss to process a 192GB file. The point of Spark is to distribute the workload onto many other slave nodes. There’s some really technical ways to determine the optimal number of slave nodes for your problem. Try something like >50 or >100 or slaves. Should help you see a decent performance uplift (each node able to process at least between 1gb-4gb of data). Still too slow? Either add more slave nodes, or choose more powerful machines for the slaves. I remember running a 100GB file through some heavy lifting took a whole day on 16 nodes. Upping the machine spec and number of slaves brought it down to an hour.
For writing files, don’t try and reinvent the wheel if you don’t need to.
Spark will automatically write your files in a distributed manner according to the level of partitioning on the dataframe. On disk, it should create a directory called outputpath which contains the n distributed files:
df.repartition(n_files)
df.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
You should get a directory structured something like this:
path/to/outputpath:
- part-737hdeu-74dhdhe-uru24.csv.gz
- part-24hejje—hrhehei-47dhe.csv.gz
- ...
Hope this helps. Also, partitioning is super important. If your initial file is not distributed (one big csv), it’s a good idea to do df.repartition(x) on the resulting dataframe after you load it, where x = number of slave nodes.

Related

Pyspark - df.cache().count() taking forever to run

I'm trying to force eager evaluation for PySpark, using the count methodology I read online:
spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
spark_df.cache().count()
However, when I try running the code, the cache count part is taking forever to run. My data size is relatively small (2.7GB, 15 mil rows), but after 28 min of running, I decided to kill the job. For comparison, when I use pandas.read_sql() method to read the data, it took only 6 min 43 seconds.
The machine I'm running the code on is pretty powerful, (20 vCPU, 160 GB RAM, Windows OS). I believe I'm missing a step to speed up the count statement.
Any help or suggestions are appreciated.

When you used pandas to read, it will use as much memory as possible from the available memory of the machine (assuming all 160Gb as you mentioned, which is by far larger than the data itself ~3Gb).
However, it's not the same with Spark. When you start your Spark session, typically you would have to mention upfront how much memory per executor (and driver, and application manager if applicable) you'd want to use, and if you don't specify it, it's going to be 1Gb according to the latest Spark documentation. So the first thing you want to do is giving more memory to your executors and driver.
Second, reading from JDBC by Spark is tricky, because slowness or not depends on the number of executors (and tasks), and those numbers depend on how many partitions your RDD (that read from JDBC connection) have, and the numbers of partitions depends on your table, your query, columns, conditions, etc. One way to force changing behavior, to have more partitions, more tasks, more executors, ... is via these configurations: numPartitions, partitionColumn, lowerBound, and
upperBound.
numPartitions is the number of partitions (hence the number of executors will be used)
partitionColumn is an integer type column that Spark would use to target partitioning
lowerBound is the min value of partitionColumn that you want to read
upperBound is the max value of partitionColumn that you want to read
You can read more here https://stackoverflow.com/a/41085557/3441510, but the basic idea is, you want to use a reasonable number of executors (defined by numPartitions), to process an equally distributed chunk of data for each executor (defined by partitionColumn, lowerBound and upperBound).

Batch Processing in Apache Beam with large overhead

I'm currently working on a larger Apache Beam pipeline with the Python API which reads data from BigQuery and in the end writes it back to another BigQuery task.
One of the transforms needs to use a binary program to transform the data, and for that it needs to load a 23GB file with binary lookup data. So starting and running the program takes a lot of overhead (takes about 2 minutes to load/run each time) and RAM, and it wouldn't make sense to start it up for just a single record. Plus the 23GB file would need to be copied locally from Cloud Storage every time.
The workflow for the binary would be:
Copy 23GB file from cloud storage if it's not there already
Save records to a file
run the binary with call()
read the output of the binary and return it
The amount of records the program can process at a time is basically unlimited, so it would be nice to get a somewhat-distributed Beam Transform, where I could specify a number of records to be processed at once (say 100'000 at a time), but still have it distributed so it can run it for 100'000 records at a time on multiple nodes.
I don't see Beam supporting this behaviour, it might be possible to hack something together as a KeyedCombineFn operation that collects records based on some split criterion/key and then runs the binary in the merge_accumulators step over the accumulated records. But this seems very hackish to me.
Or is it possible to GroupByKey and process groups as batches? Does this guarantee that each group is processed at once, or can groups be split behind the scenes by Beam?
I also saw there's a GroupIntoBatches in the Java API, which sounds like what I'd need, but isn't available in the Python SDK as far as I can tell.
My two question are, what's the best way (performance-wise) to achieve this use-case in Apache Beam, and if there isn't a good solution, is there some other Google Cloud service that might be better suited that could be used like Beam --> Other Service --> Beam ?

Groups cannot be split behind the scenes, so using a GroupByKey should work. In fact, this is a requirement since each individual element must be processed on a single machine and after a GroupByKey all values with a given key are part of the same element.
You will likely want to assign random keys. Keep in mind that if there are too many values with a given key it may also be difficult to pass all of those values to your program -- so you may also want to limit how many of the values you pass to the program at a time and/or adjust how you assign keys.
One trick for assigning random keys is to generate the random number in start bundle (say 1 to 1000) and then in process element just increment this and wrap 1001 to 1000. This avoids generating a random number for every element, and still ensures a good distribution of keys.
You could create a PTransform for both this logic (divide a PCollection<T> into PCollection<List<T>> chunks for processing), and that would be potentially reusable in similar situations.

How do I get PySpark to write intermediate results to disk before running out of memory?

Background: In Hadoop Streaming, each reduce job writes to the hdfs as it finishes, thus clearing the way for the Hadoop cluster to execute the next reduce.
I am having trouble mapping this paradigm to (Py)Spark.
As an example,
df = spark.read.load('path')
df.rdd.reduceByKey(my_func).toDF().write.save('output_path')
When I run this, the cluster collects all of the data in the dataframe before it writes anything to disk. At least this is what it looks like is happening as I watch the job progress.
My problem is that my data is much bigger than my cluster memory, so I run out of memory before any data is written. In Hadoop Streaming, we don't have this problem because the output data is streamed to the disk to make room for the subsequent batches of data.
I have considered something like this:
for i in range(100):
(df.filter(df.loop_index==i)
.rdd
.reduceByKey(my_func)
.toDF()
.write.mode('append')
.save('output_path'))
where I only process a subset of my data in each iteration. But this seems kludgy mainly because I have to either persist df, which isn't possible because of memory constraints, or I have to re-read from the input hdfs source in each iteration.
One way to make the loop work is to partition the source folders by day or some other subset of the data. But for the sake of the question, let's assume that isn't possible.
Questions: How do I run a job like this in PySpark? Do I just have to have a much bigger cluster? If so, what are the common practices for sizing a cluster before processing the data?

It might help to repartition your data in a large number of partitions. The example below would be similar to your for loop, although you may want to try with less partitions first
df = spark.read.load('path').repartition(100)
You should also review the number of executors you are currently using (--num-executors). Reducing this number should also reduce your memory footprint.

Multiprocessing with large no of files

I am trying to solve a problem. I would appreciate your valuable input on this.
Problem statement:
I am trying to read a lot of files (of the order of 10**6) in the same base directory. Each file has the name that matches the pattern (YYYY-mm-dd-hh), and the content of the files are as follows
mm1, vv1
mm2, vv2
mm3, vv3
.
.
.
where mm is the minute of the day and vv” is some numeric value with respect to that minute. I need to find, given a start-time (ex. 2010-09-22-00) and an end-time (ex. 2017-09-21-23), the average of all vv’s.
So basically user will provide me with a start_date and end_date, and I will have to get the average of all the files in between the given date range. So my function would be something like this:
get_average(start_time, end_time, file_root_directory):
Now, what I want to understand is how can I use multiprocessing to average out the smaller chunks, and then build upon that to get the final values.
NOTE: I am not looking for linear solution. Please advise me on how do I break the problem in smaller chunks and then sum it up to find the average.
I did tried using multiprocessing module in python by creating a pool of 4 processes, but I am not able to figure out how do I retain the values in memory and add the result together for all the chunks.

You process is going to be I/O bound.
Multiprocessing may not be very useful, if not counterproductive.
Moreover your storage system, base on enormous number of small files, is not the best. You should look at a time serie database such as influxdb.

Given that the actual processing is trivial—a sum and count of each file—using multiple processes or threads is not going to gain much. This is because 90+% of the effort is opening each file and transferring into memory its content.
However, the most obvious partitioning would be based on some per-data-file scheme. So if the search range is (your example) 2010-09-22-00 through 2017-09-21-23, then there are seven years with (maybe?) one file per hour for a total of 61,368 files (including two leap days).
61 thousand processes do not run very effectively on one system—at least so far. (Probably it will be a reasonable capability some years from now.) But for a real (non-supercomputing) system, partitioning the problem into a few segments, perhaps twice or thrice the number of CPUs available to do the work. This desktop computer has four cores, so I would first try 12 processes where each independently computes the sum and count (number of samples present, if variable) of 1/12 of the files.
Interprocess communication can be eliminated by using threads. Or for process oriented approach, setting up a pipe to each process to receive the results is a straightforward affair.

Number of map tasks and split size

What I'm trying to do
I'm new to hadoop and I'm trying to perform MapReduce several times with a different number of mappers and reducers, and compare the execution time. The file size is about 1GB, and I'm not specifying the split size so it should be 64MB. I'm using a machine with 4 cores.
What I've done
The mapper and reducer are written in python. So, I'm using hadoop streaming. I specified the number of map tasks and reduce tasks by using '-D mapred.map.tasks=1 -D mapred.reduce.tasks=1'
Problem
Because I specified to used 1 map task and 1 reduce task, I expected to see just one attempt but I actually have 38 map attempts, and 1 reduce task. I read tutorials and SO questions similar to this problem, and some said that the default map task is 2, but I'm getting 38 map tasks. I also read that mapred.map.tasks only suggests the number and the number of map tasks is the number of split size. However, 1GB divided by 64MB is about 17, so I still don't understand why 38 map tasks were created.
1) If I want to use only 1 map task, do I have to set the input splits size to 1GB??
2) Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??

Number of mappers is actually governed by the InputFormat you are using. Having said that, based on the type of data you are processing, InputFormat may vary. Normally, for the data stored as files in HDFS FileInputFormat, or a subclass, is used which works on the principle of MR split = HDFS block. However, this is not always true. Say you are processing a flat binary file. In such a case there is no delimiter(\n or something else) to represent the split boundary. What would you do in such a case? So, the above principle doesn't always work.
Consider another scenario wherein you are processing data stored in a DB, and not in HDFS. What will happen in such a case as there is no concept of 64MB block size when we talk about DBs?
The framework tries its best to carry out the computation in a manner as efficient as possible, which might involve creation of lesser/more number of mappers as specified/expected by you. So, in order to see how exactly mappers are getting created you need to look into the InputFormat you are using in your job. getSplits() method to be precise.
If I want to use only 1 map task, do I have to set the input splits size to 1GB??
You can override the isSplitable(FileSystem, Path) method of your InputFormat to ensure that the input files are not split-up and are processed as a whole by a single mapper.
Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??
It depends on availability. Mappers can run on multiple cores simultaneously. And a single core can run multiple mappers sequentially.

Some add-on to your question 2: the parallelism of running map/reduce tasks on a node is controllable. One can set the maximum number of map/reduce tasks running simultaneously by a tasktracker via mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum. Defaults for both parameters are 2. For 4-core node mapreduce.tasktracker.map.tasks.maximum should be increased to at least 4, i.e. to make use of each core. 2 for max-reduce-tasks is expectedly ok. Btw, finding out best values for max map/reduce tasks is non-trivial as it depends on the degree of jobs parallelism on a cluster, whether mappers/reducers of a job(-s) are io- or computationally intensive, etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.