How to prevent memory leak when testing with HiveContext in PySpark - python

I use pyspark to do some data processing and leverage HiveContext for the window function.
In order to test the code, I use TestHiveContext, basically copying the implementation from pyspark source code:
https://spark.apache.org/docs/preview/api/python/_modules/pyspark/sql/context.html
#classmethod
def _createForTesting(cls, sparkContext):
"""(Internal use only) Create a new HiveContext for testing.
All test code that touches HiveContext *must* go through this method. Otherwise,
you may end up launching multiple derby instances and encounter with incredibly
confusing error messages.
"""
jsc = sparkContext._jsc.sc()
jtestHive = sparkContext._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc)
return cls(sparkContext, jtestHive)
My tests then inherit the base class which can access the context.
This worked fine for a while. However, I started noticing some intermittent process running out of memory issues as I added more tests. Now I can't run the test suite without a failure.
"java.lang.OutOfMemoryError: Java heap space"
I explicitly stop the spark context after each test is run, but that does not appear to kill the HiveContext. Thus, I believe it keeps creating new HiveContexts everytime a new test is run and doesn't remove the old one which results in the memory leak.
Any suggestions for how to teardown the base class such that it kills the HiveContext?

If you're happy to use a singleton to hold the Spark/Hive context in all your tests, you can do something like the following.
test_contexts.py:
_test_spark = None
_test_hive = None
def get_test_spark():
if _test_spark is None:
# Create spark context for tests.
# Not really sure what's involved here for Python.
_test_spark = ...
return _test_spark
def get_test_hive():
if _test_hive is None:
sc = get_test_spark()
jsc = test_spark._jsc.sc()
_test_hive = sc._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc)
return _test_hive
And then you just import these functions in your tests.
my_test.py:
from test_contexts import get_test_spark, get_test_hive
def test_some_spark_thing():
sc = get_test_spark()
sqlContext = get_test_hive()
# etc

Related

keep my data test when running TestCase in django

i have a initial_data command in my project and when i try to create my test database and run this command
"py manage.py initial_data" it's not working because this command makes data in main database and not test_database, actually can't understand which one is test_database.
so i give up to use this command,and i have other method to resolve my problem .
and i decided to create manually my initial_data.
initial_data command actually makes my required data in my other tables.
and why i try to create data?
because i have greater than 500 tests and this tests required to other test datas.
so i have this tests:
class BankServiceTest(TestCase):
def setUp(self):
self.bank_model = models.Bank
self.country_model = models.Country
def test_create_bank_object(self):
self.bank_model.objects.create(name='bank')
re1 = self.bank_model.objects.get(name='bank')
self.assertEqual(re1.name, 'bank')
def test_create_country_object(self):
self.country_model.objects.create(name='country', code=100)
bank = self.bank_model.objects.get(name='bank')
re1 = self.country_model.objects.get(code=100)
self.assertEqual(re1.name, 'country')
i want after running "test create bank object " to have first function datas in second function
actually this method Performs the command "initial_data".
so what is the method for this problem.
i used unit_test.TestCase but it not working for me.
this method actually makes test data in main data base and i do not want this method.
or maybe i used bad of this method.
actually i need to keep my data in running test.

Python: Multiprocessing error on pandas data frame: Clients have non-trivial state that is local and unpickleable

I have a dataframe that I am dividing into multiple dataframes using groupby. Now I want to process each of these dataframes for which I have written a function process_s2id in parallel. I have the whole code in a class which I am executing using a main function in another file. But I am getting the following error:
"Clients have non-trivial state that is local and unpickleable.",
_pickle.PicklingError: Pickling client objects is explicitly not supported.
Clients have non-trivial state that is local and unpickleable.
Following is the code (we execute the main() function in this class):
import logging
import pandas as pd
from functools import partial
from multiprocessing import Pool, cpu_count
class TestClass:
def __init__(self):
logging.basicConfig(level=logging.INFO)
self.logger = logging.getLogger()
def process_s2id(self, df, col, new_col):
dim2 = ['s2id', 'date', 'hours']
df_hour = df.groupby(dim2)[[col, 'orders']].sum().reset_index()
df_hour[new_col] = df_hour[col] / df_hour['orders']
df_hour = df_hour[dim2 + [new_col]]
return df_hour
def run_parallel(self, df):
series = [frame for keys, frame in df.groupby('s2id')]
p = Pool(cpu_count())
prod_x = partial(
self.process_s2id,
col ="total_supply",
new_col = "supply"
)
s2id_supply_list = p.map(prod_x, series)
p.close()
p.join()
s2id_supply = pd.concat(s2id_supply_list, axis=0)
return ms2id_bsl
def main(self):
data = pd.read_csv("data/interim/fs.csv")
out = self.run_parallel(data)
return out
I tried running this code in Spyder and it works fine. But when I am executing it from another file. I am getting an error. Following are the execution file code and error:
import TestClass
def main():
tc = TestClass()
data = tc.main()
if __name__ == '__main__':
main()
When I looked into the error traceback, I found that the error is occuring on the line s2id_supply_list = p.map(prod_x, series) where the function is starting to go parallel. I also tried running this in series and it worked. Also, I noticed that this particular error is coming from client.py from Google cloud package. There is a certain code in which I am uploading the data to Google cloud but that should be invariant to this code. I tried searching hard for this error but all the results are linked to Google cloud package related issues and not the multiprocessing package.
Can anyone help me out in understanding this error and how can I fix it?
Other information:
I have the following versions of packages:
python==3.7.7
pandas==1.0.5
google-cloud-storage==1.20.0
google-cloud-core==1.0.3
I am running this on macbook pro.
I figured out. When we are using Pool over a function to run it parallel, it expects the first argument to be the iterator. In other words, the function will run parallelly over different values of the first argument. When we have a non-static function in a class, we have the first argument as self or the class itself. But the stupid Pool function doesn't know how to iterate with the self because it's the wrong argument. The right argument is the second one.
We can solve this by either:
Taking the function out of the class and kicking the self out of the arguments.
Adding #staticmethod on top of the function and kicking the self out of the arguments.
I hope this helps someone who is struggling with a similar problem.

CANoe: How to select and start test cases from XML Test Module from Python using CANoe COM interface?

currently I am able to:
start CANoe application
load a CANoe configuration file
load a test setup file
def load_test_setup(self, canoe_test_setup_file: str = None) -> None:
logger.info(
f'Loading CANoe test setup file <{canoe_test_setup_file}>.')
if self.measurement.Running:
logger.info(
f'Simulation is currently running, so new test setup could \
not be loaded!')
return
self.test_setup.TestEnvironments.Add(canoe_test_setup_file)
test_environment = self.test_setup.TestEnvironments.Item(1)
logger.info(f'Loaded test environment is <{test_environment.Name}>.')
How can I access the XML Test Module loaded with the test setup (tse) file and select tests to be executed?
The last before line in your snippet is most probably causing the issue.
I have been trying to fix this issue for quite some time now and finally found the solution.
Somehow when you execute the line self.test_setup.TestEnvironments.Item(1)
win32com creates an object of type TestSetupItem which doesn't have the necessary properties or methods to access the test cases. Instead we want to access objects of collection types TestSetupFolders or TestModules. win32com creates object of TestSetupItem type even though I have a single XML Test Module (called AutomationTestSeq) in the Test Environment as you can see here.
There are three possible solutions that I found.
Manually clearing the generated cache before each run.
Using win32com.client.DispatchWithEvents or win32com.client.gencache.EnsureDispatch generates a bunch of python files that describe CANoe's object model.
If you had used either of those before, TestEnvironments.Item(1) will always return TestSetupItem instead of the more appropriate type objects.
To remove the cache you need to delete the C:\Users\{username}\AppData\Local\Temp\gen_py\{python version} folder.
Doing this every time is of course not very practical.
Force win32com to always use dynamic dispatch.
You can do this by using:
canoe = win32com.client.dynamic.Dispatch("CANoe.Application")
Any objects you create using canoe from now on, will be dynamically dispatched.
Forcing dynamic dispatch is easier than manually clearing the cache folder every time. This gave me good results always. But doing this will not let you have any insight into the objects. You won't be able to see the acceptable properties and methods for the objects.
Typecast TestSetupItem to TestSetupFolders or TestModules.
This has the risk that if you typecast incorrectly, you will get unexpected results. But has worked well for me so far.
In short: win32.CastTo(test_env, "ITestEnvironment2"). This will ensure that you are using the recommended object hierarchy as per CANoe technical reference.
Note that you will also have to typecast TestSequenceItem to TestCase to be able to access test case verdict and enable/disable test cases.
Below is a decent example script.
"""Execute XML Test Cases without a pass verdict"""
import sys
from time import sleep
import win32com.client as win32
CANoe = win32.DispatchWithEvents("CANoe.Application")
CANoe.Open("canoe.cfg")
test_env = CANoe.Configuration.TestSetup.TestEnvironments.Item('Test Environment')
# Cast required since test_env is originally of type <ITestEnvironment>
test_env = win32.CastTo(test_env, "ITestEnvironment2")
# Get the XML TestModule (type <TSTestModule>) in the test setup
test_module = test_env.TestModules.Item('AutomationTestSeq')
# {.Sequence} property returns a collection of <TestCases> or <TestGroup>
# or <TestSequenceItem> which is more generic
seq = test_module.Sequence
for i in range(1, seq.Count+1):
# Cast from <ITestSequenceItem> to <ITestCase> to access {.Verdict}
# and the {.Enabled} property
tc = win32.CastTo(seq.Item(i), "ITestCase")
if tc.Verdict != 1: # Verdict 1 is pass
tc.Enabled = True
print(f"Enabling Test Case {tc.Ident} with verdict {tc.Verdict}")
else:
tc.Enabled = False
print(f"Disabling Test Case {tc.Ident} since it has already passed")
CANoe.Measurement.Start()
sleep(5) # Sleep because measurement start is not instantaneous
test_module.Start()
sleep(1)
Just continue what you have done.
The TestEnvironment contains the TestModules. Each TestModule contains a TestSequence which in turn contains the TestCases.
Keep in mind that you cannot individual TestCases but only the TestModule. But you can enable and disable individual TestCases before execution by using the COM-API.
(typing this from the top of my head, might not work 100%)
test_module = test_environment.TestModules.Item(1) # of 2 or whatever
test_sequence = test_module.Sequence
for i in range(1, test_sequence.Count + 1):
test_case = test_sequence.Item(i)
if ...:
test_case.Enabled = False # or True
test_module.Start()
You have to keep in mind that a TestSequence can also contain other TestSequences (i.e. a TestGroup). This depends on how your TestModule is setup. If so, you have to take care of that in your loop and descend into these TestGroups while searching for your TestCase of interest.

Heavy stateful UDF in pyspark

I have to run a really heavy python function as UDF in Spark and I want to cache some data inside UDF. The case is similar to one mentioned here
I am aware that it is slow and wrong.
But the existing infrastructure is in spark and I don't want set up a new infrastructure and deal with data loading/parallelization/fail safety separately for this case.
This is how my spark program looks like:
from mymodule import my_function # here is my function
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
schema = StructType().add("input", "string")
df = spark.read.format("json").schema(schema).load("s3://input_path")
udf1 = udf(my_function, StructType().add("output", "string"))
df.withColumn("result", udf1(df.input)).write.json("s3://output_path/")
The my_function internally calls a method of an object with a slow constructor.
Therefore I don't want the object to be initialized for every entry and I am trying to cache it:
from my_slow_class import SlowClass
from cachetools import cached
#cached(cache={})
def get_cached_object():
# this call is really slow therefore I am trying
# to cache it with cachetools
return SlowClass()
def my_function(input):
slow_object = get_cached_object()
output = slow_object.call(input)
return {'output': output}
mymodule and my_slow_class are installed as modules on each spark machine.
It seems working. The constructor is called only a few times (only 10-20 times for 100k lines in input dataframe). And that is what I want.
My concern is multithreading/multiprocessing inside Spark executors and if the cached SlowObject instance is shared between many parallel my_function calls.
Can I rely on the fact that my_function is called once at a time inside python processes on worker nodes? Does spark use any multiprocessing/multithreading in python process that executes my UDF?
Spark forks Python process to create individual workers, however all processing in the individual worker process is sequential, unless multithreading or multiprocessing is used explicitly by the UserDefinedFunction.
So as long as state is used for caching and slow_object.call is a pure function you have nothing to worry about.

How to run a function on all Spark workers before processing data in PySpark?

I'm running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a "setup" function on all workers on all nodes in the cluster.
The streaming task classifies incoming messages as spam or not spam, but before it can do that it needs to download the latest pre-trained models from HDFS to local disk, like this pseudo code example:
def fetch_models():
if hadoop.version > local.version:
hadoop.download()
I've seen the following examples here on SO:
sc.parallelize().map(fetch_models)
But in Spark 1.6 parallelize() requires some data to be used, like this shitty work-around I'm doing now:
sc.parallelize(range(1, 1000)).map(fetch_models)
Just to be fairly sure that the function is run on ALL workers I set the range to 1000. I also don't exactly know how many workers are in the cluster when running.
I've read the programming documentation and googled relentlessly but I can't seem to find any way to actually just distribute anything to all workers without any data.
After this initialization phase is done, the streaming task is as usual, operating on incoming data from Kafka.
The way I'm using the models is by running a function similar to this:
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition)))
Theoretically I could check whether or not the models are up to date in the on_partition function, though it would be really wasteful to do this on each batch. I'd like to do it before Spark starts retrieving batches from Kafka, since the downloading from HDFS can take a couple of minutes...
UPDATE:
To be clear: it's not an issue on how to distribute the files or how to load them, it's about how to run an arbitrary method on all workers without operating on any data.
To clarify what actually loading models means currently:
def on_partition(config, partition):
if not MyClassifier.is_loaded():
MyClassifier.load_models(config)
handle_partition(config, partition)
While MyClassifier is something like this:
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
MyClassifier.clf = load_from_file(config)
Static methods since PySpark doesn't seem to be able to serialize classes with non-static methods (the state of the class is irrelevant with relation to another worker). Here we only have to call load_models() once, and on all future batches MyClassifier.clf will be set. This is something that should really not be done for each batch, it's a one time thing. Same with downloading the files from HDFS using fetch_models().
If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:
some_path = ... # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)
and retrieve it on the workers using SparkFiles.get and standard IO tools:
from pyspark import SparkFiles
with open(SparkFiles.get(some_path)) as fw:
... # Do something
If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:
model.py:
from pyspark import SparkFiles
config = ...
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
path = SparkFiles.get(config.get("model_file"))
MyClassifier.clf = load_from_file(path)
# Executed once per interpreter
MyClassifier.load_models(config)
main.py:
from pyspark import SparkContext
config = ...
sc = SparkContext("local", "foo")
# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")
import model
ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()
ssc.start()
ssc.awaitTermination()
This is a typical use case for Spark's broadcast variables. Let's say fetch_models returns the models rather than saving them locally, you would do something like:
bc_models = sc.broadcast(fetch_models())
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition, bc_models.value)))
This does assume that your models fit in memory, on the driver and the executors.
You may be worried that broadcasting the models from the single driver to all the executors is inefficient, but it uses 'efficient broadcast algorithms' that can outperform distributing through HDFS significantly according to this analysis

Categories

Resources