Parallelization in counting Spark dataframe groups in pyspark

Parallelization in counting Spark dataframe groups in pyspark - python

I have approximately 200 files in a single directory on a Linux machine named part-0001, part-0002, and so on. Each has approximately one million rows with the same columns (call them 'a', 'b', and so on). Let the pair 'a','b' be the key for each row (with many duplicates).
At the same time, I have set up a Spark 2.2.0 cluster with a master and two slaves with a total of 42 cores available. The address is spark://XXX.YYY.com:7077.
I then use PySpark to connect to the cluster and compute the counts across the 200 files for each unique pair as follows.
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
sc = SparkContext("spark://XXX.YYY.com:7077")
sqlContext = SQLContext(sc)
data_path = "/location/to/my/data/part-*"
sparkdf = sqlContext.read.csv(path=data_path, header=True)
dfgrouped = sparkdf.groupBy(['a','b'])
counts_by_group = dfgrouped.count()
This works in that I can see Spark progressing through a series of messages and it does indeed return results that look plausible.
Problem: While this calculation is being performed top does not show any evidence that the slave cores are doing anything. There doesn't appear to be any parallelization. Each slave has a single related Java process that was there before the job (plus processes from other users and background system processes). So it appears that the master is doing all the work. Given that there are 200 odd files, I had expected to see 21 processes running on each slave machine until things wound down (this is what I see when I explicitly invoke parallelize as follows count = sc.parallelize(c=range(1, niters + 1), numSlices=ncores).map(f).reduce(add) in a separate implementation).
Questions: How do I ensure that Spark is actually parallelizing the count? I would like each core to grab one or more files, perform the count for the pairs it sees in the file, and then have the individual results reduced into a single DataFrame. Shouldn't I see this in top? Do I need to explicitly invoke parallelization?
(FWIW, I have seen example using partitioning, but my understanding is that this is used to distribute processing on chunks of a single file. My case is that I have many files.)
Thanks in advance.

TL;DR There is probably nothing wrong with your deployment.
I had expected to see 21 processes running
Unless you specifically configured Spark to use a single core per executor JVM, there is no reason for this to happen. Unlike RDD example you've mentioned in the question, DataFrame API doesn't use Python workers at all, with exception to Python UserDefinedFunctions.
At the same time, JVM executors use threading instead of full-fledged system processes (PySpark uses the later one to avoid GIL). Furthermore default spark.executor.cores in standalone mode is equal to the number of the available cores on the worker. So without additional configuration you should see two executor JVMs, each using 21 data processing threads.
Overall you should check Spark UI, if you see tasks assigned to the executors, all should be fine.

Related

Dask Client change number of workers mid-session

I have a rather large dataset across different files that I read in using dask, followed by a machine learning task for which I want to use dask as parallel backend.
I've noticed that reading in the files runs much faster using a Client with a higher number of workers instead of one worker with many threads. However, their individual share of memory is then too small to handle the ML task. I would therefore like to change the number of my workers to 1, with the maximum possible number of threads assigned to that new unique worker. Is there a way to do that without completely kiling and restarting my client?
I looked into the docs but couldn't find anything of use. Also happy about a hint where to look for this kind of info next time, if not there.
This is an example of what my current code looks like:
from dask.distributed import Client
import dask.dataframe as dd
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
client = Client(n_workers=4, threads_per_worker=2)
df = dd.read_hdf(path_to_file_dir, '/data')
feats = df['feats'].compute()
labels = df['labels'].compute()
dummy = LogisticRegression()
with parallel_backend('dask'):
dummy.fit(feats, labels) # FAILS bc of too high memory consumption

You can manually create Worker/Nanny classes if you want, or use the SpecCluster class for more fine-grained control. These are typically used by developers though, and may not be as user-friendly.

Can I process 100 GB of data using Apache Spark on my local machine?

I have around 100GB data of users and want to process it using Apache Spark on my laptop.I have installed Hadoop and Spark and for the test I uploaded a file of around 9 GB to HDFS and accessed & queried it using pyspak.
The test file has total 113959238 records/rows, when I queried the data for a particular user i.e
select * from table where userid=????
it took around 6 minutes to retrieve the records of that user and if I run on the entire file then it will take a lot of time.
The analysis that I to make on that data is to extract the records a users, run some operations on it and then process the data of second user and so on for all the users in file. The data of the user queried, will not be much so it can be loaded in memory and operations can be preformed faster. But querying the record of a user from that big file takes time and will slow the process.
It is said that Spark is lighting fast so surely I will be missing something which is why it is taking that time. One thing that I noted while performing queries was Spark was not utilizing full RAM but almost 100% of CPU.
My machine specs are:
I also queried the data directly of the text file using Spark instead of HDFS file but there wasn't much difference in time.
The python code that I wrote is
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext,SQLContext
import time
conf=SparkConf()
conf.set("spark.executor.memory", "8g")
conf.set("spark.driver.memory", "8g")
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
sc=sparkSession.sparkContext.getOrCreate(conf)
sqlContext=SQLContext(sc)
#df_load = sparkSession.read.format("csv").option("header","true").load("hdfs://0.0.0.0:19000/test.txt")
df_load = sparkSession.read.format("csv").option("header","true").load("C:/Data/test_file/test.txt")
table=df_load.registerTempTable('test')
sp_tstart=time.time()
df=sqlContext.sql("select * from test where user_id='12345'")
db=df.rdd.collect()
sp_tend=time.time()
t_time=sp_tend-sp_tstart
df.show()
print(t_time/60)
Given my machine specs, is Spark taking normal time or Do I need to configure something? Do I need to upgrade the specs or is it enough for this data?

One of the things to understand with Spark, Hadoop and other Big Data providers is that they aren't aiming to get the maximum possible throughput from an individual machine. They're aiming to let you split the processing efficiently across multiple machines. They sacrifice a certain amount of individual machine throughput to provide horizontal scalability.
While you can run Spark on just a single machine, the main reasons to do so are to learn Spark or to write test code to then deploy to run against a cluster with more data.
As others have noted, if you just want to process data on a single machine, then there are libraries which are going to be more efficient in that scenario. 100GB is not a huge amount to process on a single machine.
From the sound of things you'd be better off importing that data into a database and adding suitable indexing. One other thing to understand is that a lot of the benefit of Big Data systems is supporting analysing and processing most or all of the data. Traditional database systems like Postgres or SQL Server can work well handling Terabytes of data when you're mainly querying for small amounts of data using indices.

If your objective is to analyze 100GB of data using python and there is no requirement for spark, you can also take a look into dask. https://dask.org/ It should be easier to setup and use with python.
For example dask dataframe: https://docs.dask.org/en/latest/dataframe.html
>>> import dask.dataframe as dd
>>> df = dd.read_csv('2014-*.csv')
>>> df.head()
x y
0 1 a
1 2 b
2 3 c
3 4 a
4 5 b
5 6 c
>>> df2 = df[df.y == 'a'].x + 1

You don't need Hadoop to process the file locally.
The advantages of Hadoop only apply when you use more than one machine as your file will be chunked and distributed to many processes at once.
Similarly, 100GB of plaintext isn't really "big data"; it still fits on a single machine and if stored in a better format like ORC or Parquet, would be significantly less in size
Also, to get faster times, don't use collect()
If you simply want to lookup data by ID, use a key value database like Redis or Accumulo, not Hadoop/Spark

The type of job you have described his a highly CPU intensive process, which unfortunately is only going to be sped up significantly by running many parallel queries on partitions of the data set. Compound the problem with not having enough system memory to hold the entire dataset and now you are also limited by significant read/writes on the hard drive.
This is the type of task where Spark really shines. The reason you aren't experiencing any improvement in performance is because with a single system you're missing the benefit of Spark entirely, which is the ability to split the data set into many partitions and distribute it across many machines who can work on many different users IDs at the same time.
Each worker node in your cluster will have a smaller data set to look at, which means on each node the entire data set it is looking at can easily be stored in memory. Each find and replace function (one per user ID) can be sent to a single CPU core, which means if you have 5 workers with 16 cores each, you can process 80 IDs at a time, from memory, on an optimized partition size.
Google CloudProc and Azure Databricks are super platforms for doing this. Simply choose the number of workers you need, and the CPU/Memory of each node, and fire up the cluster. Connect to your data and kick off your PySpark code. It can process this data so quickly that even though you are paying by the minute for the cluster, it will end up being very cheap ($10-$20 perhaps).

Spark is only using one worker machine when more are available

I'm trying to parallelize a machine learning prediction task via Spark. I've used Spark successfully a number of times before on other tasks and have faced no issues with parallelization before.
In this particular task, my cluster has 4 workers. I'm calling mapPartitions on an RDD with 4 partitions. The map function loads a model from disk (a bootstrap script distributes all that is needed to do this; I've verified it exists on each slave machine) and performs prediction on data points in the RDD partition.
The code runs, but only utilizes one executor. The logs for the other executors say "Shutdown hook called". On different runs of the code, it uses different machines, but only one at a time.
How can I get Spark to use multiple machines at once?
I'm using PySpark on Amazon EMR via Zeppelin notebook. Code snippets are below.
%spark.pyspark
sc.addPyFile("/home/hadoop/MyClassifier.py")
sc.addPyFile("/home/hadoop/ModelLoader.py")
from ModelLoader import ModelLoader
from MyClassifier import MyClassifier
def load_models():
models_path = '/home/hadoop/models'
model_loader = ModelLoader(models_path)
models = model_loader.load_models()
return models
def process_file(file_contents, models):
filename = file_contents[0]
filetext = file_contents[1]
pred = MyClassifier.predict(filetext, models)
return (filename, pred)
def process_partition(file_list):
models = load_models()
for file_contents in file_list:
pred = process_file(file_contents, models)
yield pred
all_contents = sc.wholeTextFiles("s3://some-path", 4)
processed_pages = all_contents.mapPartitions(process_partition)
processedDF = processed_pages.toDF(["filename", "pred"])
processedDF.write.json("s3://some-other-path", mode='overwrite')
There are four tasks as expected, but they all run on the same executor!
I have the cluster running and can provide logs as available in Resource Manager. I just don't know yet where to look.

Two points to mention here (not sure if they will solve your issue though):
wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat, and because of CombineFileInputFormat, it will try to combine groups of small files into one partition. So if you set the number of partition to 2 for example, you 'might' get two partitions but it is not guaranteed, it depends on the size of the files you are reading.
The output of wholeTextFiles is an RDD that contains an entire file in each record (and each record/file cannot be split so it will end by being in a single partition/worker). So if you are reading one file only, you will end by having the full file in one partition despite that you set the partitioning to 4 in your example.

The process has as many as partitions you specified but it is going in serialized way.
Executors
The process might spin up default number of executors. This can be seen in the yarn resource manager. In your case all the processing is done by one executor. If executor has more than one core it will parellize the job. In emr you have do this changes in order to have more than 1 core for the executor.
What specifically happening in our case is, the data is small, so all the data is read in one executor(ie which is using one node). With out the following property the executor uses only single core. Hence all the tasks are serialized.
Setting the property
sudo vi /etc/hadoop/conf/capacity-scheduler.xml
Setting the following property as shown
"yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalcul‌ator"
In order to make this property applicable you have to restart the yarn
sudo hadoop-yarn-resourcemanager stop
Restart the yarn
sudo hadoop-yarn-resourcemanager start
When your job is submitted see the yarn and the spark-ui
In Yarn you will see more cores for executor

How to read many tables from the same database and save them to their own CSV file?

Below is a working code to connect to a SQL server,and save 1 table to a CSV format file.
conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true");
sc = new SparkContext(conf)
sqlContext = new SQLContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option("password","xxxx").load()
df.registerTempTable("test")
df.write.format("com.databricks.spark.csv").save("poc/amitesh/csv")
exit()
I ahve a scenario, where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code. Is there anyway we can achieve the objective? Or,these splits are done at the HDFS block size level, so if you have a file of 300mb, and the HDFS block size is set at 128, then you get 3 blocks of 128mb, 128mb and 44mb respectively?

where in I have to save 4 table from same database in CSV format in 4 different files at a time through pyspark code.
You have to code a transformation (reading and writing) for every table in the database (using sqlContext.read.format).
The only difference between the table-specific ETL pipeline is a different dbtable option per table. Once you have a DataFrame, save it to its own CSV file.
The code could look as follows (in Scala so I leave converting it to Python as a home exercise):
val datasetFromTABLE_ONE: DataFrame = sqlContext.
read.
format("jdbc").
option("url","jdbc:sqlserver://DBServer:PORT").
option("databaseName","xxx").
option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").
option("dbtable","TABLE_ONE").
option("user","xxx").
option("password","xxxx").
load()
// save the dataset from TABLE_ONE into its own CSV file
datasetFromTABLE_ONE.write.csv("table_one.csv")
Repeat the same code for every table you want to save to CSV.
Done!
100-table Case — Fair Scheduling
The solution begs another:
What when I have 100 or more tables? How to optimize the code for that? How to do it effectively in Spark? Any parallelization?
SparkContext that sits behind SparkSession we use for the ETL pipeline is thread-safe which means that you can use it from multiple threads. If you think about a thread per table that's the right approach.
You could spawn as many threads as you have tables, say 100, and start them. Spark could then decide what and when to execute.
That's something Spark does using Fair Scheduler Pools. That's not very widely known feature of Spark that'd be worth considering for this case:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
Use it and your loading and saving pipelines may get faster.

Slow len function on dask distributed dataframe

I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc.
import dask.dataframe as dd
from dask.distributed import Client
client = Client('192.168.1.220:8786')
log = pd.read_csv('800000test', sep='\t')
logd = dd.from_pandas(log,npartitions=20)
#This is the code than runs slowly
#(2.9 seconds whilst I would expect no more than a few hundred millisencods)
print(len(logd))
#Instead this code is actually running almost 20 times faster than pandas
logd.loc[:'Host'].count().compute()
Any ideas why this could be happening? It isn't important for me that len runs fast, but I feel that by not understanding this behaviour there is something I am not grasping about the library.
All of the green boxes correspond to "from_pandas" whilst in this article by Matthew Rocklin http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes the call graph looks better (len_chunk is called which is significantly faster and the calls don't seem to be locked by and wait for one worker to finish his task before starting the other)

Good question, this gets at a few points about when data is moving up to the cluster and back down to the client (your python session). Lets look at a few stages of your compuation
Load data with Pandas
This is a Pandas dataframe in your python session, so it's obviously still in your local process.
log = pd.read_csv('800000test', sep='\t') # on client
Convert to a lazy Dask.dataframe
This breaks up your Pandas dataframe into twenty Pandas dataframes, however these are still on the client. Dask dataframes don't eagerly send data up to the cluster.
logd = dd.from_pandas(log,npartitions=20) # still on client
Compute len
Calling len actually causes computation here (normally you would use df.some_aggregation().compute(). So now Dask kicks in. First it moves your data out to the cluster (slow) then it calls len on all of the 20 partitions (fast), it aggregates those (fast) and then moves the result down to your client so that it can print.
print(len(logd)) # costly roundtrip client -> cluster -> client
Analysis
So the problem here is that our dask.dataframe still had all of its data in the local python session.
It would have been much faster to use, say, the local threaded scheduler rather than the distributed scheduler. This should compute in milliseconds
with dask.set_options(get=dask.threaded.get): # no cluster, just local threads
print(len(logd)) # stays on client
But presumably you want to know how to scale out to larger datasets, so lets do this the right way.
Load your data on the workers
Instead of loading with Pandas on your client/local session, let the Dask workers load bits of the csv file. This way no client-worker communication is necessary.
# log = pd.read_csv('800000test', sep='\t') # on client
log = dd.read_csv('800000test', sep='\t') # on cluster workers
However, unlike pd.read_csv, dd.read_csv is lazy, so this should return almost immediately. We can force Dask to actually do the computation with the persist method
log = client.persist(log) # triggers computation asynchronously
Now the cluster kicks into action and loads your data directly in the workers. This is relatively fast. Note that this method returns immediately while work happens in the background. If you want to wait until it finishes, call wait.
from dask.distributed import wait
wait(log) # blocks until read is done
If you're testing with a small dataset and want to get more partitions, try changing the blocksize.
log = dd.read_csv(..., blocksize=1000000) # 1 MB blocks
Regardless, operations on log should now be fast
len(log) # fast
Edit
In response to a question on this blogpost here are the assumptions that we're making about where the file lives.
Generally when you provide a filename to dd.read_csv it assumes that that file is visible from all of the workers. This is true if you are using a network file system, or a global store like S3 or HDFS. If you are using a network file system then you will want to either use absolute paths (like /path/to/myfile.*.csv) or else ensure that your workers and client have the same working directory.
If this is not the case, and your data is only on your client machine, then you will have to load and scatter it out.
Simple but sub-optimal
The simple way is just to do what you did originally, but persist your dask.dataframe
log = pd.read_csv('800000test', sep='\t') # on client
logd = dd.from_pandas(log,npartitions=20) # still on client
logd = client.persist(logd) # moves to workers
This is fine, but results in slightly less-than-ideal communication.
Complex but optimal
Instead, you might scatter your data out to the cluster explicitly
[future] = client.scatter([log])
This gets into more complex API though, so I'll just point you to docs
http://distributed.readthedocs.io/en/latest/manage-computation.html
http://distributed.readthedocs.io/en/latest/memory.html
http://dask.pydata.org/en/latest/delayed-collections.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.