multi-processing with spark(PySpark) [duplicate] - python

This question already has an answer here:
How to run independent transformations in parallel using PySpark?
(1 answer)
Closed 5 years ago.
The usecase is the following:
I have a large dataframe, with a 'user_id' column in it (every user_id can appear in many rows). I have a list of users my_users which I need to analyse.
Groupby, filter and aggregate could be a good idea, but the available aggregation functions included in pyspark did not fit my needs. In the pyspark ver, user defined aggregation functions are still not fully supported and I decided to leave it for now..
Instead, I simply iterate the my_users list, filter each user in the dataframe, and analyse. In order to optimize this procedure, I decided to use python multiprocessing pool, for each user in my_users
The function that does the analysis (and passed to the pool) takes two arguments: the user_id, and a path to the main dataframe, on which I perform all the computations (PARQUET format). In the method, I load the dataframe, and work on it (DataFrame can't be passed as an argument itself)
I get all sorts of weird errors, on some of the processes (different in each run), that look like:
PythonUtils does not exist in the JVM (when reading the 'parquet' dataframe)
KeyError: 'c' not found (also, when reading the 'parquet' dataframe. What is 'c' anyway??)
When I run it without any multiprocessing, everything runs smooth, but slow..
Any ideas where these errors are coming from?
I'll put some code sample just to make things clearer:
PYSPRAK_SUBMIT_ARGS = '--driver-memory 4g --conf spark.driver.maxResultSize=3g --master local[*] pyspark-shell' #if it's relevant
# ....
def users_worker(df_path, user_id):
df = spark.read.parquet(df_path) # The problem is here!
## the analysis of user_id in df is here
def user_worker_wrapper(args):
users_worker(*args)
def analyse():
# ...
users_worker_args = [(df_path, user_id) for user_id in my_users]
users_pool = Pool(processes=len(my_users))
users_pool.map(users_worker_wrapper, users_worker_args)
users_pool.close()
users_pool.join()

Indeed, as #user6910411 commented, when I changed the Pool to be threadPool (multiprocessing.pool.ThreadPool package), everything worked as expected and these errors were gone.
The root reasons for the errors themselves are also clear now, if you want me to share them, please comment below.

Related

Select different dataset when testing | Separate test from production

This question is partly about how to test external dependencies (a.k.a. integration tests) and partly how to implement it with Python for SQL with BigQuery in specific. So answers only about 'This is how you should do integration tests' are very welcome.
In my project I have two different datasets
'project_1.production.table_1'
'project_1.development.table_1'
When running my tests I would like to call the development environment. But how to separate it properly from my production code as I don't want to clutter my production code with test(set-up) code.
Production code looks like:
def find_data(variable_x: string) -> DataFrame:
query = '''
SELECT *
FROM `project_1.production.table_1`
WHERE foo = #variable_x
'''
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter(
name='foo', type_="STRING", value=variable_x
)
]
)
df = self.client.query(
query=query, job_config=job_config).to_dataframe()
return df
Solution 1 : Environment variables for the dataset
The python-dotenv module can be used to differentiate production from development, as I do for some parts of my code. The problem is that bigQuery does not allow to parameterize the dataset. (To prevent SQL-injection I think) See running parameterized queries docs
From the docs
Parameters cannot be used as substitutes for identifiers, column
names, table names, or other parts of the query.
So having the environment variable as dataset name is not possible.
Solution 2 : Environment variable for flow control
I could add a if production == True evaluation and select the dataset. However this results in test/debug code in my production code. I would like to avoid it as much as possible.
from os import getenv
def find_data(variable_x : string) -> Dataframe:
load_dotenv()
PRODUCTION = getenv("PRODUCTION")
if PRODUCTION == TRUE:
*Execute query on project_1.production.table_1*
else:
*Execute query on project_1.development.table_1*
job_config = (*snip*)
df = (*snip*)
return df
Solution 3 : Mimic function in testcode
Make a copy of the production code and set up the test code so that the development dataset is called.
This leads to duplication of code (one in production code and one in test code). A result of this duplication will lead to a mismatch of the code may the implementation of the function change over time. So I think this solution is not 'Embracing Change'
Solution 4 : Skip testing this function
Perhaps this function does not need to be called at all in my test code. Just take a snippet of the result of this query and use the result as a 'data injection' into the tests that depend on this result. However then I need to adjust my architecture a bit.
The above solutions don't satisfy me completely. I wonder if there is another way to solve this issue or if one of the above solutions is acceptable?
It looks like string formatting (sometimes referred to as string interpolation) might be enough to get you where you want. You could replace the first part of your function by the following code:
query = '''
SELECT *
FROM `{table}`
WHERE foo = #variable_x
'''.format(table = getenv("DATA_TABLE"))
This works because the query is just a string and you can do whatever you want with it before you pass it on the the BigQuery library. The String.format allows us to replace values inside a string, which is exactly what we need (see this article for a more in depth explanation about String.format)
Important security note: it is in general a bad security practice to manipulate SQL queries as plain strings (as we are doing here), but since you control the environment variables of the application it should be safe in this particular case.

Possible overhead on dask computation over list of delayed objects

I have a ddf with lots of partitions
ddf = dd.read_parquet("./input-*", engine='fastparquet')
ddf
Dask DataFrame Structure:
datetime ndvi str utm_x utm_y fpath scl_value
npartitions=71
Dask Name: read-parquet, 71 tasks
In each partition I want to run a custom function
my_df_list = list()
for arg_key, arg_value in my_dict_of_args.items() :
ddf_item = ddf_sliced.map_partitions(myfunc,
my_arg1 = arg_key,
my_arg2 = arg_value,
meta = my_meta)
my_df_list.append(ddf_item)
Things start to get tricky there, I have experienced the following command is too much for my pc, taking forever the beginning of the first item computation and eventually depleting all my ram:
dask.compute(*my_df_list)
Example graph using 2 dfs instead 71, dask.visualize(*my_df_list):
But it can handle easily the computation of each partition, one by one:
my_df_list[0].compute()
...
my_df_list[71].compute()
Example graph using 2 dfs instead 71 my_df_list[0].visualize():
Im struggling understanding the difference since to me its the same iteration scheme.
If it is indeed an overhead I will be glad to get some alternative flows to not call .compute on each element manually.
EDIT 1
After posting the graph images I understand dask.compute(*list) boost parallelism to optimize the df readings. See documentation section, Avoid calling compute repeatedly.
Now I can see the real problem is the initialization of the graph and probably my code: even loading 2 dfs instead of 71, my memory is depleted far before the real computation starts, when using dask.compute(*list)

How to: Pyspark dataframe persist usage and reading-back

I'm quite new to pyspark, and I'm having the following error: Py4JJavaError: An error occurred while calling o517.showString. and I've read that is due to a lack of memory:Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
So, I've been reading that a turn-around to this situation is to use df.persist() and then read again the persisted df, so I would like to know:
Given a for loop in which I do some .join operations, should I use the .persist() inside the loop or at the end of it? e.g.
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer').persist()
--> or <--
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer')
df_AA.persist()
Once I've done that, how should I read back?
df_AA.unpersist()? sqlContext.read.some_thing(df_AA)?
I'm really new to this, so please, try to explain as best as you can.
I'm running on a local machine (8GB ram), using jupyter-notebooks(anaconda); windows 7; java 8; python 3.7.1; pyspark v2.4.3
Spark is lazy evaluated framework so, none of the transformations e.g: join are called until you call an action.
So go ahead with what you have done
from pyspark import StorageLevel
for col in columns:
df_AA = df_AA.join(df_B, df_AA[col] == 'some_value', 'outer')
df_AA.persist(StorageLevel.MEMORY_AND_DISK)
df_AA.show()
There multiple persist options available so choosing the MEMORY_AND_DISK will spill the data that cannot be handled in memory into DISK.
Also GC errors could be a result of lesser DRIVER memory provided for the Spark Application to run.

How to recover from an error in a nested celery chain?

The kind of workflow that I want to run looks like this:
workflow = (
generator.s() |
spread.s() |
gather.s()
)
where spread is a task that replaces itself with a group.
from celery import Celery, group
celery_app = Celery()
#celery_app.task(bind=True)
def spread(self, numbers):
return self.replace(group(
(task_1.si(n) | task_2.s() | task_3.s()) for n in numbers
)
)
The whole workflow works fine and as expected.
My question is essentially only about the chains in the group created by spread. I don't care too much if some of them fail. I'm fine if an error somewhere in the chain would lead to a shorter list of results being passed to gather. However, I'm not sure how to achieve that.
I can, of course, catch exceptions in each of task_1, task_2, and task_3 and pass on an empty dummy result. For convenience I'd really like to be able to say that on an error anywhere in the chain, please log the traceback and remove the result from the group or pass on an empty dummy result.
I've searched the documentation and GitHub issues far and wide but could not find anything. I know that I can pass an on_error callback to the chain but I don't know how to pass on an empty result from there (if that's even possible).
Setup:
Python 3.6
celery 4.2.1
Redis broker and backend (though it's not a problem for me to switch if that would enable the behavior)

How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session

Q: How to change SparkContext property spark.sql.pivotMaxValues in jupyter PySpark session
I made the following code change to increase spark.sql.pivotMaxValues. It sadly had no effect in the resulting error after restarting jupyter and running the code again.
from pyspark import SparkConf, SparkContext
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
import numpy as np
try:
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker') # original
#conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", "99999")
conf = SparkConf().setMaster('local').setAppName('autoencoder_recommender_wide_user_record_maker').set("spark.sql.pivotMaxValues", 99999)
sc = SparkContext(conf=conf)
except:
print("Variables sc and conf are now defined. Everything is OK and ready to run.")
<... (other code) ...>
df = sess.read.csv(in_filename, header=False, mode="DROPMALFORMED", schema=csv_schema)
ct = df.crosstab('username', 'itemname')
Spark error message that was thrown on my crosstab line of code:
IllegalArgumentException: "requirement failed: The number of distinct values for itemname, can't exceed 1e4. Currently 16467"
I expect I'm not actually setting the config variable that I was trying to set, so what is a way to get that value actually set, programmatically if possible? THanks.
References:
Finally, you may be interested to know that there is a maximum number
of values for the pivot column if none are specified. This is mainly
to catch mistakes and avoid OOM situations. The config key is
spark.sql.pivotMaxValues and its default is 10,000.
Source: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
I would prefer to change the config variable upwards, since I have written the crosstab code already which works great on smaller datasets. If it turns out there truly is no way to change this config variable then my backup plans are, in order:
relational right outer join to implement my own Spark crosstab with higher capacity than was provided by databricks
scipy dense vectors with handmade unique combinations calculation code using dictionaries
kernel.json
This configuration file should be distributed together with jupyter
~/.ipython/kernels/pyspark/kernel.json
It contains SPARK configuration, including variable PYSPARK_SUBMIT_ARGS - list of arguments that will be used with spark-submit script.
You can try to add --conf spark.sql.pivotMaxValues=99999 to this variable in mentioned script.
PS
There are also cases where people are trying to override this variable programmatically. You can give it a try too...

Categories

Resources