What I'm trying to do
I'm new to hadoop and I'm trying to perform MapReduce several times with a different number of mappers and reducers, and compare the execution time. The file size is about 1GB, and I'm not specifying the split size so it should be 64MB. I'm using a machine with 4 cores.
What I've done
The mapper and reducer are written in python. So, I'm using hadoop streaming. I specified the number of map tasks and reduce tasks by using '-D mapred.map.tasks=1 -D mapred.reduce.tasks=1'
Problem
Because I specified to used 1 map task and 1 reduce task, I expected to see just one attempt but I actually have 38 map attempts, and 1 reduce task. I read tutorials and SO questions similar to this problem, and some said that the default map task is 2, but I'm getting 38 map tasks. I also read that mapred.map.tasks only suggests the number and the number of map tasks is the number of split size. However, 1GB divided by 64MB is about 17, so I still don't understand why 38 map tasks were created.
1) If I want to use only 1 map task, do I have to set the input splits size to 1GB??
2) Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??
Number of mappers is actually governed by the InputFormat you are using. Having said that, based on the type of data you are processing, InputFormat may vary. Normally, for the data stored as files in HDFS FileInputFormat, or a subclass, is used which works on the principle of MR split = HDFS block. However, this is not always true. Say you are processing a flat binary file. In such a case there is no delimiter(\n or something else) to represent the split boundary. What would you do in such a case? So, the above principle doesn't always work.
Consider another scenario wherein you are processing data stored in a DB, and not in HDFS. What will happen in such a case as there is no concept of 64MB block size when we talk about DBs?
The framework tries its best to carry out the computation in a manner as efficient as possible, which might involve creation of lesser/more number of mappers as specified/expected by you. So, in order to see how exactly mappers are getting created you need to look into the InputFormat you are using in your job. getSplits() method to be precise.
If I want to use only 1 map task, do I have to set the input splits size to 1GB??
You can override the isSplitable(FileSystem, Path) method of your InputFormat to ensure that the input files are not split-up and are processed as a whole by a single mapper.
Let's say I successfully specify that I want to use only 2 map tasks, does it use 2 cores? And each core has 1 map task??
It depends on availability. Mappers can run on multiple cores simultaneously. And a single core can run multiple mappers sequentially.
Some add-on to your question 2: the parallelism of running map/reduce tasks on a node is controllable. One can set the maximum number of map/reduce tasks running simultaneously by a tasktracker via mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum. Defaults for both parameters are 2. For 4-core node mapreduce.tasktracker.map.tasks.maximum should be increased to at least 4, i.e. to make use of each core. 2 for max-reduce-tasks is expectedly ok. Btw, finding out best values for max map/reduce tasks is non-trivial as it depends on the degree of jobs parallelism on a cluster, whether mappers/reducers of a job(-s) are io- or computationally intensive, etc.
Related
I'm trying to force eager evaluation for PySpark, using the count methodology I read online:
spark_df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
spark_df.cache().count()
However, when I try running the code, the cache count part is taking forever to run. My data size is relatively small (2.7GB, 15 mil rows), but after 28 min of running, I decided to kill the job. For comparison, when I use pandas.read_sql() method to read the data, it took only 6 min 43 seconds.
The machine I'm running the code on is pretty powerful, (20 vCPU, 160 GB RAM, Windows OS). I believe I'm missing a step to speed up the count statement.
Any help or suggestions are appreciated.
When you used pandas to read, it will use as much memory as possible from the available memory of the machine (assuming all 160Gb as you mentioned, which is by far larger than the data itself ~3Gb).
However, it's not the same with Spark. When you start your Spark session, typically you would have to mention upfront how much memory per executor (and driver, and application manager if applicable) you'd want to use, and if you don't specify it, it's going to be 1Gb according to the latest Spark documentation. So the first thing you want to do is giving more memory to your executors and driver.
Second, reading from JDBC by Spark is tricky, because slowness or not depends on the number of executors (and tasks), and those numbers depend on how many partitions your RDD (that read from JDBC connection) have, and the numbers of partitions depends on your table, your query, columns, conditions, etc. One way to force changing behavior, to have more partitions, more tasks, more executors, ... is via these configurations: numPartitions, partitionColumn, lowerBound, and
upperBound.
numPartitions is the number of partitions (hence the number of executors will be used)
partitionColumn is an integer type column that Spark would use to target partitioning
lowerBound is the min value of partitionColumn that you want to read
upperBound is the max value of partitionColumn that you want to read
You can read more here https://stackoverflow.com/a/41085557/3441510, but the basic idea is, you want to use a reasonable number of executors (defined by numPartitions), to process an equally distributed chunk of data for each executor (defined by partitionColumn, lowerBound and upperBound).
I have a question pertaining to the scheduling/execution order of tasks in dask.distributed for the case of strong data reduction of a large raw dataset.
We are using dask.distributed for a code which extracts information from movie frames. Its specific application is crystallography, but quite generally the steps are:
Read frames of a movie stored as a 3D array in a HDF5 file (or a few thereof which are concatenated) into a dask array. This is obviously quite I/O-heavy
Group these frames into consecutive sub-stacks of typically 10 move stills, the frames of which are aggregated (summed or averaged), resulting in a single 2D image.
Run several, computationally heavy, analysis functions on the 2D image (such as positions of certain features), returning a dictionary of results, which is negligibly small compared to the movie itself.
We implement this by using the dask.array API for steps 1 and 2 (the latter using map_blocks with a block/chunk size of one or a few of the aggregation sub-stacks), then converting the array chunks to dask.delayed objects (using to_delayed) which are passed to a function doing the actual data reduction function (step 3). We take care to properly align the chunks of the HDF5 arrays, the dask computation and the aggregation ranges in step 2 such that the task graph of each final delayed object (elements of tasks) is very clean. Here's the example code:
def sum_sub_stacks(mov):
# aggregation function
sub_stk = []
for k in range(mov.shape[0]//10):
sub_stk.append(mov[k*10:k*10+10,...].sum(axis=0, keepdims=True))
return np.concatenate(sub_stk)
def get_info(mov):
# reduction function
results = []
for frame in mov:
results.append({
'sum': frame.sum(),
'variance': frame.var()
# ...actually much more complex/expensive stuff
})
return results
# connect to dask.distributed scheduler
client = Client(address='127.0.0.1:8786')
# 1: get the movie
fh = h5py.File('movie_stack.h5')
movie = da.from_array(fh['/entry/data/raw_counts'], chunks=(100,-1,-1))
# 2: sum sub-stacks within movie
movie_aggregated = movie.map_blocks(sum_sub_stacks,
chunks=(10,) + movie.chunks[1:],
dtype=movie.dtype)
# 3: create and run reduction tasks
tasks = [delayed(get_info)(chk)
for chk in movie_aggregated.to_delayed().ravel()]
info = client.compute(tasks, sync=True)
The ideal scheduling of operations would clearly be for each worker to perform the 1-2-3 sequence on a single chunk and then move on to the next, which would keep I/O load constant, CPUs maxed out and memory low.
What happens instead is that first all workers are trying to read as many chunks as possible from the files (step 1) which creates an I/O bottleneck and quickly exhausts the worker memory causing thrashing to the local drives. Often, at some point workers eventually move to steps 2/3 which quickly frees up memory and properly uses all CPUs, but in other cases workers get killed in an uncoordinated way or the entire computation is stalling. Also intermediate cases happen where surviving workers behave reasonably for a while only.
Is there any way to give hints to the scheduler to process the tasks in the preferred order as described above or are there other means to improve the scheduling behavior? Or is there something inherently stupid about this code/way of doing things?
First, there is nothing inherently stupid about what you are doing at all!
In general, Dask tries to reduce the number of temporaries it is holding onto and it also balances this with parallelizability (width of the graph and the number of workers). Scheduling is complex and there is yet another optimization Dask uses which fuses tasks together to make them more optimal. With lots of little chunks you may run into issues: https://docs.dask.org/en/latest/array-best-practices.html?highlight=chunk%20size#select-a-good-chunk-size
Dask does have a number of optimization configurations which I would recommend playing with after considering other chunksizes. I would also encourage you to read through the following issue as there is a healthy discussion around scheduling configurations.
Lastly, you might consider additional memory configuration of your workers as you may want to more tightly control how much memory each worker should use
I have this code which splits the dataframe in 10000 rows and writes to file.
I tried instance with z1d with 24cpu and 192GB but even that didn't do much speed and for 1 million rows it took 9 mins.
This is code
total = df2.count()
offset = 10000
counter = int(total/offset) + 1
idxDf = df.withColumn("idx", monotonically_increasing_id())
for i in range(0, counter):
lower = i * offset
upper = lower + offset
filter = f"idx > {lower} and idx < {upper}"
ddf = idxDf.filter(filter)
ddf2 = ddf.drop("idx")
ddf2.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
Is there any way i can make in it. Currently i am using single master node only. I have 100 million rows and want to know how fast i can do that with emr.
Look like my normal python code is also able to do the same stuff in same minutes
A few problems with what you’re trying to do here:
Stop trying to write pyspark code as if it’s normal python code. It isn’t. Read up on exactly how spark works first and foremost. You’ll have more success if you change the way you program when you use spark, not try to get spark to do what you want in the way you want.
Avoid for loops with Spark wherever possible. for loops only work within native python, so you’re not utilising spark when you start one. Which means one CPU on one Spark node will run the code.
Python is, by default, single threaded. Adding more CPUs will do literally nothing to performance for native python code (ie your for loop) unless you rewrite your code for either (a) multi-threaded processing (b) distributed processing (ie spark).
You only have one master node (and I assume zero slaves nodes). That’s going to take aaaaaaggggggggeeeessss to process a 192GB file. The point of Spark is to distribute the workload onto many other slave nodes. There’s some really technical ways to determine the optimal number of slave nodes for your problem. Try something like >50 or >100 or slaves. Should help you see a decent performance uplift (each node able to process at least between 1gb-4gb of data). Still too slow? Either add more slave nodes, or choose more powerful machines for the slaves. I remember running a 100GB file through some heavy lifting took a whole day on 16 nodes. Upping the machine spec and number of slaves brought it down to an hour.
For writing files, don’t try and reinvent the wheel if you don’t need to.
Spark will automatically write your files in a distributed manner according to the level of partitioning on the dataframe. On disk, it should create a directory called outputpath which contains the n distributed files:
df.repartition(n_files)
df.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
You should get a directory structured something like this:
path/to/outputpath:
- part-737hdeu-74dhdhe-uru24.csv.gz
- part-24hejje—hrhehei-47dhe.csv.gz
- ...
Hope this helps. Also, partitioning is super important. If your initial file is not distributed (one big csv), it’s a good idea to do df.repartition(x) on the resulting dataframe after you load it, where x = number of slave nodes.
I'm currently working on a larger Apache Beam pipeline with the Python API which reads data from BigQuery and in the end writes it back to another BigQuery task.
One of the transforms needs to use a binary program to transform the data, and for that it needs to load a 23GB file with binary lookup data. So starting and running the program takes a lot of overhead (takes about 2 minutes to load/run each time) and RAM, and it wouldn't make sense to start it up for just a single record. Plus the 23GB file would need to be copied locally from Cloud Storage every time.
The workflow for the binary would be:
Copy 23GB file from cloud storage if it's not there already
Save records to a file
run the binary with call()
read the output of the binary and return it
The amount of records the program can process at a time is basically unlimited, so it would be nice to get a somewhat-distributed Beam Transform, where I could specify a number of records to be processed at once (say 100'000 at a time), but still have it distributed so it can run it for 100'000 records at a time on multiple nodes.
I don't see Beam supporting this behaviour, it might be possible to hack something together as a KeyedCombineFn operation that collects records based on some split criterion/key and then runs the binary in the merge_accumulators step over the accumulated records. But this seems very hackish to me.
Or is it possible to GroupByKey and process groups as batches? Does this guarantee that each group is processed at once, or can groups be split behind the scenes by Beam?
I also saw there's a GroupIntoBatches in the Java API, which sounds like what I'd need, but isn't available in the Python SDK as far as I can tell.
My two question are, what's the best way (performance-wise) to achieve this use-case in Apache Beam, and if there isn't a good solution, is there some other Google Cloud service that might be better suited that could be used like Beam --> Other Service --> Beam ?
Groups cannot be split behind the scenes, so using a GroupByKey should work. In fact, this is a requirement since each individual element must be processed on a single machine and after a GroupByKey all values with a given key are part of the same element.
You will likely want to assign random keys. Keep in mind that if there are too many values with a given key it may also be difficult to pass all of those values to your program -- so you may also want to limit how many of the values you pass to the program at a time and/or adjust how you assign keys.
One trick for assigning random keys is to generate the random number in start bundle (say 1 to 1000) and then in process element just increment this and wrap 1001 to 1000. This avoids generating a random number for every element, and still ensures a good distribution of keys.
You could create a PTransform for both this logic (divide a PCollection<T> into PCollection<List<T>> chunks for processing), and that would be potentially reusable in similar situations.
I have a large dataset with 500 million rows and 58 variables. I need to sort the dataset using one of the 59th variable which is calculated using the other 58 variables. The variable happens to be a floating point number with four places after decimal.
There are two possible approaches:
The normal merge sort
While calculating the 59th variables, i start sending variables in particular ranges to to particular nodes. Sort the ranges in those nodes and then combine them in the reducer once i have perfectly sorted data and now I also know where to merge what set of data; It basically becomes appending.
Which is a better approach and why?
I'll assume that you are looking for a total sort order without a secondary sort for all your rows. I should also mention that 'better' is never a good question since there is typically a trade-off between time and space and in Hadoop we tend to think in terms of space rather than time unless you use products that are optimized for time (TeraData has the capability of putting Databases in memory for Hadoop use)
Out of the two possible approaches you mention, I think only one would work within the Hadoop infrastructure. Num 2, Since Hadoop leverages many nodes to do one job, sorting becomes a little trickier to implement and we typically want the 'shuffle and sort' phase of MR to take care of the sorting since distributed sorting is at the heart of the programming model.
At the point when the 59th variable is generated, you would want to sample the distribution of that variable so that you can send it through the framework then merge like you mentioned. Consider the case when the variable distribution of x contain 80% of your values. What this might do is send 80% of your data to one reducer who would do most of the work. This assumes of course that some keys will be grouped in the sort and shuffle phase which would be the case unless you programmed them unique. It's up to the programmer to set up partitioners to evenly distribute the load by sampling the key distribution.
If on the other hand we were to sort in memory then we could accomplish the same thing during reduce but there are inherent scalability issues since the sort is only as good as the amount of memory available in the node currently running the sort and dies off quickly when it starts to use HDFS to look for the rest of the data that did not fit into memory. And if you ignored the sampling issue you will likely run out of memory unless all your key values pairs are evenly distributed and you understand the memory capacity within your data.
Check out the Hadoop Comparator Class Part of HadoopStreaming Wiki Page
You can move the datasets to HDFS, use Python to write a mapper and do a hadoop streaming mapper only job. The Hadoop Streaming will automatically help you sort them.
Then you can use hdfs dfs -getmerge and -copyToLocal to move the sorted records back to local if you want.