Spark Structured Streaming - Empty dictionary on new batch

Spark Structured Streaming - Empty dictionary on new batch - python

In my constructor I initialize an empty dictionary and then in a udf I update it with the new data that arrived from the batch.
My problem is that in every new batch the dictionary is empty again.
How can I bypass the empty step, so new batches have access to all previous values I have already added in my dictionary ?
import CharacteristicVector
import update_charecteristic_vector
class SomeClass(object):
def __init__(self):
self.grid_list = {}
def run_stream(self):
def update_grid_list(grid):
if grid not in self.grid_list:
grid_list[grid] =
if grid not in self.grid_list:
self.grid_list[grid] = CharacteristicVector()
self.grid_list[grid] = update_charecteristic_vector(self.grid_list[grid])
return self.grid_list[grid].Density
.
.
.
udf_update_grid_list = udf(update_grid_list, StringType())
grids_dataframe = hashed.select(
hashed.grid.alias('grid'),
update_list(hashed.grid).alias('Density')
)
query = grids_dataframe.writeStream.format("console").start()
query.awaitTermination()

Unfortunately, this code cannot work for multiple reasons. Even with single batch or in a batch application it will work only if there is only active Python worker process. Also, it is not possible in general, to have global synchronized stat, with support for both reads and writes.
You should be able to use stateful transformations, but for now, there are supported only in Java / Scala and interface is still experimental / evolving.
Depending on your requirements you can try to use in memory data grid, key-value store, or distributed cache.

Related

In dagster, how do I load_asset_value from a job executed in process with mem_io_manager?

For this question, consider I have a repository with one asset:
#asset
def my_int():
return 1
#repository
def my_repo():
return [my_int]
I want to execute it in process (with mem_io_manager), but I would like to retrieve the value returned by my_int from memory later. I can do that with fs_io_manager, for example, using my_repo.load_asset_value('my_int'), after it ran. But the same method with mem_io_manager raises dagster._core.errors.DagsterInvariantViolationError: Attempting to access step_key, but it was not provided when constructing the OutputContext.
Ideally, I would execute it in process and tell the executor to return me one (or more) of the assets, something like:
my_assets = my_repo.get_job('__ASSET_JOB').execute_in_process(return_assets=[my_int, ...])

mem_io_manager doesn't store objects to file storage like fs_io_manager. You could in your my_int asset,
save the value to a file or some other cloud storage and retrieve it later or
Add the value as metadata if it is a simple integer or string and retrieve that later.
For the second case, using metadata, you can do:
#asset
def my_int(context):
return Output(my_int_value, metadata={'my_int_value': my_int_value})
and to retrieve it later you could in another asset:
#asset
def retrieve_my_int(context):
asset_key = 'my_int'
latest_materialization_event = (
self.init_context.instance.get_latest_materialization_events(
[asset_key]
).get(asset_key)
)
if latest_materialization_event:
materialization = (
latest_materialization_event.dagster_event.event_specific_data.materialization
)
metadata = {
entry.label: entry.entry_data
for entry in materialization.metadata_entries
}
retrieved_int = metadata['my_int_value'].value if 'my_int_value' in metadata.keys() else None
.......
the metadata approach has limitations, as you can only store certain kinds of data. If you want to store any kind of data, you'd have to execute the jobs differently so that the results can be materialized to a file system or an io_manager of choice.
You'd have to instead of execute_in_process, use materialize.
#asset
def my_int(context):
....
#asset
def asset_other(context):
....
if __name__ == '__main__':
asset_results = materialize(
load_assets_from_current_module()
)
This will materialize the assets and you could specify which io_manager to use in the resource parameter. To retrieve an asset value, you can do
my_int_value = asset_results.output_for_node('my_int')

Your question is a bit unclear - do you want to materialize the asset in-process, then at a later time / in a different process access the result? Or do you just want to execute in process and get the result back?
In the former case, #Kay is correct that the result will disappear after the process completes as when using the mem_io_manager the memory the result is stored in is tied to the lifecycle of the process.
In the latter case, you should be able to do something like
from dagster import materialize
asset_result = materialize([my_int])

Both answers given so far would require the materialization (to disk), which is not what I wanted in the first place, I wanted to retrieve the value from memory.
But #kay pointed me in the right direction. output_for_node works on execute_in_process result, so that the following code achieves what I desired (retrieving my_int results from memory, after the jobs execution).
from dagster import asset, repository
#asset
def my_int():
return 1
#repository
def my_repo():
return [my_int]
my_assets = my_repo.get_job('__ASSET_JOB').execute_in_process()
my_assets.output_for_node("my_int")

Ray object store running out of memory using out of core. How can I configure an external object store like s3 bucket?

import ray
import numpy as np
ray.init()
#ray.remote
def f():
return np.zeros(10000000)
results = []
for i in range(100):
print(i)
results += ray.get([f.remote() for _ in range(50)])
Normally, when the object store fills up, it begins evicting objects that are not in use (in a least-recently used fashion). However, because all of the objects are numpy arrays that are being held in the results list, they are all still in use, and the memory that those numpy arrays live in is actually in the object store, so they are taking up space in the object store. The object store can't evict them until those objects go out of scope.
Question: How can I specify an external object store like redis without exceeding memory on single machine? I don't want to use /dev/shm or /tmp as object store as only limited memory is available and it quickly fills up

As of ray 1.2.0, the object spilling to support out-of-core data processing is supported. Fro 1.3+ (which will be released in 3 weeks), this feature will be turned on by default.
https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html
But your example won't work with this feature. Let me explain why here.
There are two things you need to know.
When you call ray task (f.remote) or ray.put, it returns an object reference. Try
ref = f.remote()
print(ref)
When you run ray.get on this reference, then now the python variable accesses to the memory directly (in Ray, it will be in shared memory, which is managed by a distributed object store of ray called plasma store if your object size is >= 100KB). So,
obj = ray.get(ref) # Now, obj is pointing to the shared memory directly.
Currently, the object spilling feature support disk spilling for the 1 case, but not for 2 (2 is much trickier to support if you imagine).
So there are 2 solutions here;
Use a file directory for your plasma store. For example, start ray with
ray.init(_plasma_directory="/tmp")
This will allow you to use tmp folder as a plasma store (meaning ray objects are stored in the tmp file system). Note you can possibly see the performance degradation when you use this option.
Use the object spilling with backpressure. Instead of getting all of ray objects using ray.get, use ray.wait.
import ray
import numpy as np
# Note: You don't need to specify this if you use the latest master.
ray.init(
_system_config={
"automatic_object_spilling_enabled": True,
"object_spilling_config": json.dumps(
{"type": "filesystem", "params": {"directory_path": "/tmp/spill"}},
)
},
)
#ray.remote
def f():
return np.zeros(10000000)
result_refs = []
for i in range(100):
print(i)
result_refs += [f.remote() for _ in range(50)]
while result_refs:
[ready], result_refs = ray.wait(result_refs)
result = ray.get(ready)

How to run a function on all Spark workers before processing data in PySpark?

I'm running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a "setup" function on all workers on all nodes in the cluster.
The streaming task classifies incoming messages as spam or not spam, but before it can do that it needs to download the latest pre-trained models from HDFS to local disk, like this pseudo code example:
def fetch_models():
if hadoop.version > local.version:
hadoop.download()
I've seen the following examples here on SO:
sc.parallelize().map(fetch_models)
But in Spark 1.6 parallelize() requires some data to be used, like this shitty work-around I'm doing now:
sc.parallelize(range(1, 1000)).map(fetch_models)
Just to be fairly sure that the function is run on ALL workers I set the range to 1000. I also don't exactly know how many workers are in the cluster when running.
I've read the programming documentation and googled relentlessly but I can't seem to find any way to actually just distribute anything to all workers without any data.
After this initialization phase is done, the streaming task is as usual, operating on incoming data from Kafka.
The way I'm using the models is by running a function similar to this:
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition)))
Theoretically I could check whether or not the models are up to date in the on_partition function, though it would be really wasteful to do this on each batch. I'd like to do it before Spark starts retrieving batches from Kafka, since the downloading from HDFS can take a couple of minutes...
UPDATE:
To be clear: it's not an issue on how to distribute the files or how to load them, it's about how to run an arbitrary method on all workers without operating on any data.
To clarify what actually loading models means currently:
def on_partition(config, partition):
if not MyClassifier.is_loaded():
MyClassifier.load_models(config)
handle_partition(config, partition)
While MyClassifier is something like this:
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
MyClassifier.clf = load_from_file(config)
Static methods since PySpark doesn't seem to be able to serialize classes with non-static methods (the state of the class is irrelevant with relation to another worker). Here we only have to call load_models() once, and on all future batches MyClassifier.clf will be set. This is something that should really not be done for each batch, it's a one time thing. Same with downloading the files from HDFS using fetch_models().

If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:
some_path = ... # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)
and retrieve it on the workers using SparkFiles.get and standard IO tools:
from pyspark import SparkFiles
with open(SparkFiles.get(some_path)) as fw:
... # Do something
If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:
model.py:
from pyspark import SparkFiles
config = ...
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
path = SparkFiles.get(config.get("model_file"))
MyClassifier.clf = load_from_file(path)
# Executed once per interpreter
MyClassifier.load_models(config)
main.py:
from pyspark import SparkContext
config = ...
sc = SparkContext("local", "foo")
# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")
import model
ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()
ssc.start()
ssc.awaitTermination()

This is a typical use case for Spark's broadcast variables. Let's say fetch_models returns the models rather than saving them locally, you would do something like:
bc_models = sc.broadcast(fetch_models())
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition, bc_models.value)))
This does assume that your models fit in memory, on the driver and the executors.
You may be worried that broadcasting the models from the single driver to all the executors is inefficient, but it uses 'efficient broadcast algorithms' that can outperform distributing through HDFS significantly according to this analysis

How to improve performance of a script operating on large amount of data?

My machine learning script produces a lot of data (millions of BTrees contained in one root BTree) and store it in ZODB's FileStorage, mainly because all of it wouldn't fit in RAM. Script also frequently modifies previously added data.
When I increased the complexity of the problem, and thus more data needs to be stored, I noticed performance issues - script is now computing data on average from two to even ten times slower (the only thing that changed is amount of data to be stored and later retrieved to be changed).
I tried setting cache_size to various values between 1000 and 50000. To be honest, the differences in speed were negligible.
I thought of switching to RelStorage but unfortunately in the docs they mention only how to configure frameworks such as Zope or Plone. I'm using ZODB only.
I wonder if RelStorage would be faster in my case.
Here's how I currently setup ZODB connection:
import ZODB
connection = ZODB.connection('zodb.fs', ...)
dbroot = connection.root()
It's clear for me that ZODB is currently the bottleneck of my script.
I'm looking for advice on how I could solve this problem.
I chose ZODB beacuse I thought that NoSQL database would better fit my case and I liked the idea of the interface similar to Python's dict.
Code and data structures:
root data structures:
if not hasattr(dbroot, 'actions_values'):
dbroot.actions_values = BTree()
if not hasattr(dbroot, 'games_played'):
dbroot.games_played = 0
actions_values is conceptually built as follows:
actions_values = { # BTree
str(state): { # BTree
# contiains actions (coulmn to pick to be exact, as I'm working on agent playing Connect 4)
# and their values(only actions previously taken by the angent are present here), e.g.:
1: 0.4356
5: 0.3456
},
# other states
}
state is a simple 2D array representing game board. Possible vales of it's fields are 1, 2 or None:
board = [ [ None ] * cols for _ in xrange(rows) ]
(in my case rows = 6 and cols = 7)
main loop:
should_play = 10000000
transactions_freq = 10000
packing_freq = 50000
player = ReinforcementPlayer(dbroot.actions_values, config)
while dbroot.games_played < should_play:
# max_epsilon at start and then linearly drops to min_epsilon:
epsilon = max_epsilon - (max_epsilon - min_epsilon) * dbroot.games_played / (should_play - 1)
dbroot.games_played += 1
sys.stdout.write('\rPlaying game %d of %d' % (dbroot.games_played, should_play))
sys.stdout.flush()
board_state = player.play_game(epsilon)
if(dbroot.games_played % transactions_freq == 0):
print('Commiting...')
transaction.commit()
if(dbroot.games_played % packing_freq == 0):
print('Packing DB...')
connection.db().pack()
(packing also takes much time but it's not the main problem; I could pack database after program finishes)
Code operating on dbroot (inside ReinforcementPlayer):
def get_actions_with_values(self, player_id, state):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
return self.actions_values[lookup_state_str]
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
return self.mirror_actions(self.actions_values[mirror_lookup_state_str])
return None
def get_value_of_action(self, player_id, state, action, default=0):
actions = self.get_actions_with_values(player_id, state)
if actions is None:
return default
return actions.get(action, default)
def set_value_of_action(self, player_id, state, action, value):
if player_id == 1:
lookup_state = state
else:
lookup_state = state.switch_players()
lookup_state_str = str(lookup_state)
if lookup_state_str in self.actions_values:
self.actions_values[lookup_state_str][action] = value
return
mirror_lookup_state_str = str(lookup_state.mirror())
if mirror_lookup_state_str in self.actions_values:
self.actions_values[mirror_lookup_state_str][self.mirror_action(action)] = value
return
self.actions_values[lookup_state_str] = BTree()
self.actions_values[lookup_state_str][action] = value
(Functions with mirror in name simply reverse the columns (actions). It is done beacuse Connect 4 boards which are vertical reflections of each other are equivalent.)
After 550000 games len(dbroot.actions_values) is 6018450.
According to iotop IO operations take 90% of the time.

Using any (other) database would probably not help, as they are subject to same disk IO and memory limitations as ZODB. If you manage to offload computations to the database engine itself (PostgreSQL + using SQL scripts) it might help, as the database engine would have more information to make intelligent choices how to execute the code, but there is nothing magical here and same things can be most likely done with ZODB with quite ease.
Some ideas what can be done:
Have indexes of data instead of loading full objects (equal to SQL "full table scan"). Keep intelligent preprocesses copies of data: indexes, sums, partials.
Make the objects themselves smaller (Python classes have __slots__ trick)
Use transactions in intelligent fashion. Don't try to process all data in a single big chunk.
Parallel processing - use all CPU cores instead of single threaded approach
Don't use BTrees - maybe there is something more efficient for your use case
Having some code samples of your script, actual RAM and Data.fs sizes, etc. would help here to give further ideas.

Just to be clear here, which BTree class are you actually using? An OOBTree?
Two aspects about those btrees:
1) Each BTree is composed of a number of Buckets. Each Bucket will hold a certain number of items before being split. I can't remember how many items they hold currently, but I did once try tweaking the C-code for them and recompile to hold a larger number as the value chosen was chosen nearly two decades ago.
2) It is sometime possible to construct very un-balanced Btrees. e.g. if you add values in sorted order (e.g. a timestamp that only ever increases) then you will end up with a tree that ends up being O(n) to search. There was a script written by the folks at Jarn a number of years ago that could rebalance the BTrees in Zope's Catalog, which might be adaptable for you.
3) Rather than using an OOBTree you can use an OOBucket instead. This will end up being just a single pickle in the ZODB, so may end up too big in your use case, but if you are doing all the writes in a single transaction than it may be faster (at the expense of having to re-write the entire Bucket on an update).
-Matt

pykka -- Actors are slow?

I am currently experimenting with Actor-concurreny (on Python), because I want to learn more about this. Therefore I choosed pykka, but when I test it, it's more than half as slow as an normal function.
The Code is only to look if it works; it's not meant to be elegant. :)
Maybe I made something wrong?
from pykka.actor import ThreadingActor
import numpy as np
class Adder(ThreadingActor):
def add_one(self, i):
l = []
for j in i:
l.append(j+1)
return l
if __name__ == '__main__':
data = np.random.random(1000000)
adder = Adder.start().proxy()
adder.add_one(data)
adder.stop()
This runs not so fast:
time python actor.py
real 0m8.319s
user 0m8.185s
sys 0m0.140s
And now the dummy 'normal' function:
def foo(i):
l = []
for j in i:
l.append(j+1)
return l
if __name__ == '__main__':
data = np.random.random(1000000)
foo(data)
Gives this result:
real 0m3.665s
user 0m3.348s
sys 0m0.308s

So what is happening here is that your functional version is creating two very large lists which is the bulk of the time. When you introduce actors, mutable data like lists must be copied before being sent to the actor to maintain proper concurrency. Also the list created inside the actor must be copied as well when sent back to the sender. This means that instead of two very large lists being created we have four very large lists created instead.
Consider designing things so that data is constructed and maintained by the actor and then queried by calls to the actor minimizing the size of messages getting passed back and forth. Try to apply the principal of minimal data movement. Passing the List in the functional case is only efficient because the data is not actually moving do to leveraging a shared memory space. If the actor was on a different machine we would not have the benefit of a shared memory space even if the message data was immutable and didn't need to be copied.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark Structured Streaming - Empty dictionary on new batch - python

Related

In dagster, how do I load_asset_value from a job executed in process with mem_io_manager?

Ray object store running out of memory using out of core. How can I configure an external object store like s3 bucket?

How to run a function on all Spark workers before processing data in PySpark?

How to improve performance of a script operating on large amount of data?

pykka -- Actors are slow?

Categories

Resources