How to read files in parallel in DataBricks? - python

Could someone tell me how to read files in parallel? I'm trying something like this:
def processFile(path):
df = spark.read.json(path)
return df.count()
paths = ["...", "..."]
distPaths = sc.parallelize(paths)
counts = distPaths.map(processFile).collect()
print(counts)
It fails with the following error:
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Is there any other way to optimize this?

In your particular case, you can just pass the whole paths array to DataFrameReader:
df = spark.read.json(paths)
...and reading its file elements will be parallelized by Spark.

Related

Type Error thrown during Multiprocessing while trying to append results to a shared list

I am trying to parallelize a function that appends results to a shared list using the multiprocessing library and keep getting the following error:
TypeError: '<=' not supported between instances of 'ListProxy' and 'Int'
I think the part that is tripping me up is the appending to append each result to a shared list. If I take this part of the code out, the script runs fine. However, the whole point in running this script is to append a lot of things to one list.
A google search tells me that this error is caused because i am "comparing a sequence to an integer". I get what the error is saying, but I don't know how to interpret it in terms of the multiprocessing library
I am not sure what I am doing wrong here. Could someone please point me in the right direction?
Below is a simplified version of the code I am trying to use
import lasio
import glob
from multiprocessing import Pool, Process, Manager
#generates a list of filepaths to be fed to the parralelization
wellfiles = []
for file in glob.glob(....filedierctory...//*):
wellfiles.append(file)
def process_file(filepath,L):
#read the file
las = lasio.read(filepath)
#extract the value I want from the file
uwi = las.well[11].value
#print the value
print(uwi)
#append the value to a shared list. THIS IS WHERE I THINK THINGS ARE FAILING
L.append(uwi)
if __name__ == '__main__':
with Manager() as manager:
L = manager.list()
p = Pool(5)
p.map(process_file, wellfiles,L)
p.join()
Please let me know if I can clarify anything

How to: Pyspark dataframe persist usage and reading-back

I'm quite new to pyspark, and I'm having the following error: Py4JJavaError: An error occurred while calling o517.showString. and I've read that is due to a lack of memory:Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
So, I've been reading that a turn-around to this situation is to use df.persist() and then read again the persisted df, so I would like to know:
Given a for loop in which I do some .join operations, should I use the .persist() inside the loop or at the end of it? e.g.
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer').persist()
--> or <--
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer')
df_AA.persist()
Once I've done that, how should I read back?
df_AA.unpersist()? sqlContext.read.some_thing(df_AA)?
I'm really new to this, so please, try to explain as best as you can.
I'm running on a local machine (8GB ram), using jupyter-notebooks(anaconda); windows 7; java 8; python 3.7.1; pyspark v2.4.3
Spark is lazy evaluated framework so, none of the transformations e.g: join are called until you call an action.
So go ahead with what you have done
from pyspark import StorageLevel
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer')
df_AA.persist(StorageLevel.MEMORY_AND_DISK)
df_AA.show()
There multiple persist options available so choosing the MEMORY_AND_DISK will spill the data that cannot be handled in memory into DISK.
Also GC errors could be a result of lesser DRIVER memory provided for the Spark Application to run.

How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session

Q: How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session
I made the following code change to increase spark.sql.pivotMaxValues. It sadly had no effect in the resulting error after restarting jupyter and running the code again.
from pyspark import SparkConf, SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
import numpy as np
try:
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker') # original
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", "99999")
conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", 99999)
sc = SparkContext(conf=conf)
except:
print("Variables sc and conf are now defined. Everything is OK and ready to run.")
<... (other code) ...>
df = sess.read.csv(in_filename, header=False, mode="DROPMALFORMED", schema=csv_schema)
ct = df.crosstab('username', 'itemname')
Spark error message that was thrown on my crosstab line of code:
IllegalArgumentException: "requirement failed: The number of distinct values for itemname, can't exceed 1e4. Currently 16467"
I expect I'm not actually setting the config variable that I was trying to set, so what is a way to get that value actually set, programmatically if possible? THanks.
References:
Finally, you may be interested to know that there is a maximum number
of values for the pivot column if none are specified. This is mainly
to catch mistakes and avoid OOM situations. The config key is
spark.sql.pivotMaxValues and its default is 10,000.
Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
I would prefer to change the config variable upwards, since I have written the crosstab code already which works great on smaller datasets. If it turns out there truly is no way to change this config variable then my backup plans are, in order:
relational right outer join to implement my own Spark crosstab with higher capacity than was provided by databricks
scipy dense vectors with handmade unique combinations calculation code using dictionaries
kernel.json
This configuration file should be distributed together with jupyter
~/.ipython/kernels/pyspark/kernel.json
It contains SPARK configuration, including variable PYSPARK_SUBMIT_ARGS - list of arguments that will be used with spark-submit script.
You can try to add --conf spark.sql.pivotMaxValues=99999 to this variable in mentioned script.
PS
There are also cases where people are trying to override this variable programmatically. You can give it a try too...

multi-processing with spark(PySpark) [duplicate]

This question already has an answer here:
How to run independent transformations in parallel using PySpark?
(1 answer)
Closed 5 years ago.
The usecase is the following:
I have a large dataframe, with a 'user_id' column in it (every user_id can appear in many rows). I have a list of users my_users which I need to analyse.
Groupby, filter and aggregate could be a good idea, but the available aggregation functions included in pyspark did not fit my needs. In the pyspark ver, user defined aggregation functions are still not fully supported and I decided to leave it for now..
Instead, I simply iterate the my_users list, filter each user in the dataframe, and analyse. In order to optimize this procedure, I decided to use python multiprocessing pool, for each user in my_users
The function that does the analysis (and passed to the pool) takes two arguments: the user_id, and a path to the main dataframe, on which I perform all the computations (PARQUET format). In the method, I load the dataframe, and work on it (DataFrame can't be passed as an argument itself)
I get all sorts of weird errors, on some of the processes (different in each run), that look like:
PythonUtils does not exist in the JVM (when reading the 'parquet' dataframe)
KeyError: 'c' not found (also, when reading the 'parquet' dataframe. What is 'c' anyway??)
When I run it without any multiprocessing, everything runs smooth, but slow..
Any ideas where these errors are coming from?
I'll put some code sample just to make things clearer:
PYSPRAK_SUBMIT_ARGS = '--driver-memory 4g --conf spark.driver.maxResultSize=3g --master local[*] pyspark-shell' #if it's relevant
# ....
def users_worker(df_path, user_id):
df = spark.read.parquet(df_path) # The problem is here!
## the analysis of user_id in df is here
def user_worker_wrapper(args):
users_worker(*args)
def analyse():
# ...
users_worker_args = [(df_path, user_id) for user_id in my_users]
users_pool = Pool(processes=len(my_users))
users_pool.map(users_worker_wrapper, users_worker_args)
users_pool.close()
users_pool.join()
Indeed, as #user6910411 commented, when I changed the Pool to be threadPool (multiprocessing.pool.ThreadPool package), everything worked as expected and these errors were gone.
The root reasons for the errors themselves are also clear now, if you want me to share them, please comment below.

How to run a function on all Spark workers before processing data in PySpark?

I'm running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a "setup" function on all workers on all nodes in the cluster.
The streaming task classifies incoming messages as spam or not spam, but before it can do that it needs to download the latest pre-trained models from HDFS to local disk, like this pseudo code example:
def fetch_models():
if hadoop.version > local.version:
hadoop.download()
I've seen the following examples here on SO:
sc.parallelize().map(fetch_models)
But in Spark 1.6 parallelize() requires some data to be used, like this shitty work-around I'm doing now:
sc.parallelize(range(1, 1000)).map(fetch_models)
Just to be fairly sure that the function is run on ALL workers I set the range to 1000. I also don't exactly know how many workers are in the cluster when running.
I've read the programming documentation and googled relentlessly but I can't seem to find any way to actually just distribute anything to all workers without any data.
After this initialization phase is done, the streaming task is as usual, operating on incoming data from Kafka.
The way I'm using the models is by running a function similar to this:
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition)))
Theoretically I could check whether or not the models are up to date in the on_partition function, though it would be really wasteful to do this on each batch. I'd like to do it before Spark starts retrieving batches from Kafka, since the downloading from HDFS can take a couple of minutes...
UPDATE:
To be clear: it's not an issue on how to distribute the files or how to load them, it's about how to run an arbitrary method on all workers without operating on any data.
To clarify what actually loading models means currently:
def on_partition(config, partition):
if not MyClassifier.is_loaded():
MyClassifier.load_models(config)
handle_partition(config, partition)
While MyClassifier is something like this:
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
MyClassifier.clf = load_from_file(config)
Static methods since PySpark doesn't seem to be able to serialize classes with non-static methods (the state of the class is irrelevant with relation to another worker). Here we only have to call load_models() once, and on all future batches MyClassifier.clf will be set. This is something that should really not be done for each batch, it's a one time thing. Same with downloading the files from HDFS using fetch_models().
If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:
some_path = ... # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)
and retrieve it on the workers using SparkFiles.get and standard IO tools:
from pyspark import SparkFiles
with open(SparkFiles.get(some_path)) as fw:
... # Do something
If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:
model.py:
from pyspark import SparkFiles
config = ...
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
path = SparkFiles.get(config.get("model_file"))
MyClassifier.clf = load_from_file(path)
# Executed once per interpreter
MyClassifier.load_models(config)
main.py:
from pyspark import SparkContext
config = ...
sc = SparkContext("local", "foo")
# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")
import model
ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()
ssc.start()
ssc.awaitTermination()
This is a typical use case for Spark's broadcast variables. Let's say fetch_models returns the models rather than saving them locally, you would do something like:
bc_models = sc.broadcast(fetch_models())
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition, bc_models.value)))
This does assume that your models fit in memory, on the driver and the executors.
You may be worried that broadcasting the models from the single driver to all the executors is inefficient, but it uses 'efficient broadcast algorithms' that can outperform distributing through HDFS significantly according to this analysis

Categories

Resources