Python: Asynchronous Cassandra Inserts

Python: Asynchronous Cassandra Inserts - python

The problem with the Cassandra Python driver is that the "future" objects returned add a callback via side effect. Meaning the "future" is not composable in the same sense that a Future from either Javascript or Scala is composable. I am wondering if there is a pattern that can be used to bring about transforming a non-composable future into a composable future (preferably without leak memory.)
my_query_object.insert(1, 2, 3, 'Fred Flinstone')
.insert(1, 2, 3, 'Barney Rubble')
.insert(5000, 2, 3, 'George Jetson')
.insert(5000, 2, 3, 'Jane his wife')
Looking at the performance section of the Cassandra Python driver from Datastax, I see an example of how they're creating a constantly chainable series of insert queries. Namely a slightly more complex version of this pattern:
def insert_next(previous_result=sentinel):
if previous_result is not sentinel:
if isinstance(previous_result, BaseException):
log.error("Error on insert: %r", previous_result)
future = session.execute_async(query)
# NOTE: this callback also handles errors
future.add_callbacks(insert_next, insert_next)
which works great as a toy example. The minute one query is done, another equivalent query is executed again. This scheme allows them to achieve 7k writes/s while the version which does not attempt to "chain" callbacks caps out around 2k writes/s.
I've been trying to get my head around creating some sort of mechanism which allows me to recapture that exact mechanism but to no avail. Anyone come up with something like it?

Took me a bit to think about how to preserve the future in some form or another:
import logging
from Queue import Queue #queue in python 3
from threading import Event #hmm... this needed?
insert_logger = logging.getLogger('async_insert')
insert_logger.setLevel(logging.INFO)
def handle_err(err):
insert_logger.warning('Failed to insert due to %s', err)
#Designed to work in a high write environment. Chained callbacks for best performance and fast fail/stop when error
#encountered. Next insert should re-up the writing. Potential loss of failed write. Some guarantee on order of write
#preservation.
class CappedQueueInserter(object):
def __init__(self, session, max_count=0):
self.__queue = Queue(max_count)
self.__session = session
self.__started = Event()
#property
def started(self):
return self.__started.is_set()
def insert(self, bound_statement):
if not self.started:
self._begin(bound_statement)
else:
self._enqueue(bound_statement)
def _begin(self, bound_statement):
def callback():
try:
bound = self.__queue.get(True) #block until an item is added to the queue
future = self.__session.execute_async(bound)
future.add_callbacks(callback, handle_err)
except:
self.__started.clear()
self.__started.set()
future = self.__session.execute_async(bound_statement)
future.add_callbacks(callback, handle_err)
def _enqueue(self, bound_statement):
self.__queue.put(bound_statement, True)
#Separate insert statement binding from the insertion loop
class InsertEnqueue(object):
def __init__(self, prepared_query, insert, consistency_level=None):
self.__statement = prepared_query
self.__level = consistency_level
self.__sink = insert
def insert(self, *args):
bound = self.bind(*args)
self.__sink.insert(bound)
#property
def consistency_level(self):
return self.__level or self.__statement.consistency_level
#consistency_level.setter
def adjust_level(self, value):
if value:
self.__level = value
def bind(self, *args):
bound = self.__statement.bind(*args)
bound.consistency_level = self.consistency_level
return bound
Combination of a Queue and an Event to trigger things. Assuming that write can happen "eventually" this should work.

Related

Threading, multiprocessing shared memory and asyncio

I am having trouble implementing the following scheme :
class A:
def __init__(self):
self.content = []
self.current_len = 0
def __len__(self):
return self.current_len
def update(self, new_content):
self.content.append(new_content)
self.current_len += 1
class B:
def __init__(self, id):
self.id = id
And I also have these 2 functions that will be called later in the main :
async def do_stuff(first_var, second_var):
""" this function is ideally called from the main in another
process. Also, first_var and second_var are not modified so it
would be nice if they could be given by reference without
having to copy them """
### used for first call
yield None
while len(first_var) < CERTAIN_NUMBER:
time.sleep(10)
while True:
## do stuff
if condition_met:
yield new_second_var ## which is a new instance of B
## continue doing stuff
def do_other_stuff(first_var, second_var):
while True:
queue = multiprocessing.JoinableQueue()
results = multiprocessing.Queue()
### do stuff
first_var.update(results)
The main looks like this at the moment :
first_var = A()
second_var = B()
while True:
async for new_second_var in do_stuff(first_var, second_var):
if new_second_var:
## stop the do_other_stuff that is currently running
## to re-launch it with the updated new_var
do_other_stuff(first_var, new_second_var)
else: ## used for the first call
do_other_stuff(first_var, second_var)
Here are my questions :
Is there a better solution to make this scheme work?
How can I implement the "stopping" part since there is a while True loop that fills first_var by reference?
Will the instance of A (first_var) be passed by reference to do_stuff if first_var doesn't get modified inside it?
Is it even possible to have an asynchronous generator in another process?
Is it even possible at all?
This is using Python 3.6 for the async generators.
I hope this is somewhat clear! Thanks a lot!

Python Multiprocessing.Pool in a class

I am trying to change my code to a more object oriented format. In doing so I am lost on how to 'visualize' what is happening with multiprocessing and how to solve it. On the one hand, the class should track changes to local variables across functions, but on the other I believe multiprocessing creates a copy of the code which the original instance would not have access to. I need to figure out a way to manipulate classes, within a class, using multiprocessing, and have the parent class retain all manipulated values in the nested classes.
A simple version (OLD CODE):
function runMultProc():
...
dictReports = {}
listReports = ['reportName1.txt', 'reportName2.txt']
tasks = []
pool = multiprocessing.Pool()
for report in listReports:
if report not in dictReports:
dictReports[today][report] = {}
tasks.append(pool.apply_async(worker, args=([report, dictReports[today][report]])))
else:
continue
for task in tasks:
report, currentReportDict = task.get()
dictReports[report] = currentFileDict
function worker(report, currentReportDict):
<Manipulate_reports_dict>
return report, currentReportDict
NEW CODE:
class Transfer():
def __init__(self):
self.masterReportDictionary[<todays_date>] = [reportObj1, reportObj2]
def processReports(self):
self.pool = multiprocessing.Pool()
self.pool.map(processWorker, self.masterReportDictionary[<todays_date>])
self.pool.close()
self.pool.join()
def processWorker(self, report):
# **process and manipulate report, currently no return**
report.name = 'foo'
report.path = '/path/to/report'
class Report():
def init(self):
self.name = ''
self.path = ''
self.errors = {}
self.runTime = ''
self.timeProcessed = ''
self.hashes = {}
self.attempts = 0
I don't think this code does what I need it to do, which is to have it process the list of reports in parallel AND, as processWorker manipulates each report class object, store those results. As I am fairly new to this I was hoping someone could help.
The big difference between the two is that the first one build a dictionary and returned it. The second model shouldn't really be returning anything, I just need for the classes to finish being processed and they should have relevant information within them.
Thanks!

Issue with sharing data between Python processes with multiprocessing

I've seen several posts about this, so I know it is fairly straightforward to do, but I seem to be coming up short. I'm not sure if I need to create a worker pool, or use the Queue class. Basically, I want to be able to create several processes that each act autonomously (which is why they inherit from the Agent superclass).
At random ticks of my main loop I want to update each Agent. I'm using time.sleep with different values in the main loop and the Agent's run loop to simulate different processor speeds.
Here is my Agent superclass:
# Generic class to handle mpc of each agent
class Agent(mpc.Process):
# initialize agent parameters
def __init__(self,):
# init mpc
mpc.Process.__init__(self)
self.exit = mpc.Event()
# an agent's main loop...generally should be overridden
def run(self):
while not self.exit.is_set():
pass
print "You exited!"
# safely shutdown an agent
def shutdown(self):
print "Shutdown initiated"
self.exit.set()
# safely communicate values to this agent
def communicate(self,value):
print value
A specific agent's subclass (simulating an HVAC system):
class HVAC(Agent):
def __init__(self, dt=70, dh=50.0):
super(Agent, self).__init__()
self.exit = mpc.Event()
self.__pref_heating = True
self.__pref_cooling = True
self.__desired_temperature = dt
self.__desired_humidity = dh
self.__meas_temperature = 0
self.__meas_humidity = 0.0
self.__hvac_status = "" # heating, cooling, off
self.start()
def run(self): # handle AC or heater on
while not self.exit.is_set():
ctemp = self.measureTemp()
chum = self.measureHumidity()
if (ctemp < self.__desired_temperature):
self.__hvac_status = 'heating'
self.__meas_temperature += 1
elif (ctemp > self.__desired_temperature):
self.__hvac_status = 'cooling'
self.__meas_temperature += 1
else:
self.__hvac_status = 'off'
print self.__hvac_status, self.__meas_temperature
time.sleep(0.5)
print "HVAC EXITED"
def measureTemp(self):
return self.__meas_temperature
def measureHumidity(self):
return self.__meas_humidity
def communicate(self,updates):
self.__meas_temperature = updates['temp']
self.__meas_humidity = updates['humidity']
print "Measured [%d] [%f]" % (self.__meas_temperature,self.__meas_humidity)
And my main loop:
if __name__ == "__main__":
print "Initializing subsystems"
agents = {}
agents['HVAC'] = HVAC()
# Run simulation
timestep = 0
while timestep < args.timesteps:
print "Timestep %d" % timestep
if timestep % 10 == 0:
curr_temp = random.randrange(68,72)
curr_humidity = random.uniform(40.0,60.0)
agents['HVAC'].communicate({'temp':curr_temp, 'humidity':curr_humidity})
time.sleep(1)
timestep += 1
agents['HVAC'].shutdown()
print "HVAC process state: %d" % agents['HVAC'].is_alive()
So the issue is that, whenever I run agents['HVAC'].communicate(x) within the main loop, I can see the value being passed into the HVAC subclass in its run loop (so it prints the received value correctly). However, the value never is successfully stored.
So typical output looks like this:
Initializing subsystems
Timestep 0
Measured [68] [56.948675]
heating 1
heating 2
Timestep 1
heating 3
heating 4
Timestep 2
heating 5
heating 6
When in reality, as soon as Measured [68] appears, the internal stored value should be updated to output 68 (not heating 1, heating 2, etc.). So effectively, the HVAC's self.__meas_temperature is not being properly updated.
Edit: After a bit of research, I realized that I didn't necessarily understand what is happening behind the scenes. Each subprocess operates with its own virtual chunk of memory and is completely abstracted away from any data being shared this way, so passing the value in isn't going to work. My new issue is that I'm not necessarily sure how to share a global value with multiple processes.
I was looking at the Queue or JoinableQueue packages, but I'm not necessarily sure how to pass a Queue into the type of superclass setup that I have (especially with the mpc.Process.__init__(self) call).
A side concern would be if I can have multiple agents reading values out of the queue without pulling it out of the queue? For instance, if I wanted to share a temperature value with multiple agents, would a Queue work for this?
Pipe v Queue

Here's a suggested solution assuming that you want the following:
a centralized manager / main process which controls lifetimes of the workers
worker processes to do something self-contained and then report results to the manager and other processes
Before I show it though, for the record I want to say that in general unless you are CPU bound multiprocessing is not really the right fit, mainly because of the added complexity, and you'd probably be better of using a different high-level asynchronous framework. Also, you should use python 3, it's so much better!
That said, multiprocessing.Manager, makes this pretty easy to do using multiprocessing. I've done this in python 3 but I don't think anything shouldn't "just work" in python 2, but I haven't checked.
from ctypes import c_bool
from multiprocessing import Manager, Process, Array, Value
from pprint import pprint
from time import sleep, time
class Agent(Process):
def __init__(self, name, shared_dictionary, delay=0.5):
"""My take on your Agent.
Key difference is that I've commonized the run-loop and used
a shared value to signal when to stop, to demonstrate it.
"""
super(Agent, self).__init__()
self.name = name
# This is going to be how we communicate between processes.
self.shared_dictionary = shared_dictionary
# Create a silo for us to use.
shared_dictionary[name] = []
self.should_stop = Value(c_bool, False)
# Primarily for testing purposes, and for simulating
# slower agents.
self.delay = delay
def get_next_results(self):
# In the real world I'd use abc.ABCMeta as the metaclass to do
# this properly.
raise RuntimeError('Subclasses must implement this')
def run(self):
ii = 0
while not self.should_stop.value:
ii += 1
# debugging / monitoring
print('%s %s run loop execution %d' % (
type(self).__name__, self.name, ii))
next_results = self.get_next_results()
# Add the results, along with a timestamp.
self.shared_dictionary[self.name] += [(time(), next_results)]
sleep(self.delay)
def stop(self):
self.should_stop.value = True
print('%s %s stopped' % (type(self).__name__, self.name))
class HVACAgent(Agent):
def get_next_results(self):
# This is where you do your work, but for the sake of
# the example just return a constant dictionary.
return {'temperature': 5, 'pressure': 7, 'humidity': 9}
class DumbReadingAgent(Agent):
"""A dumb agent to demonstrate workers reading other worker values."""
def get_next_results(self):
# get hvac 1 results:
hvac1_results = self.shared_dictionary.get('hvac 1')
if hvac1_results is None:
return None
return hvac1_results[-1][1]['temperature']
# Script starts.
results = {}
# The "with" ensures we terminate the manager at the end.
with Manager() as manager:
# the manager is a subprocess in its own right. We can ask
# it to manage a dictionary (or other python types) for us
# to be shared among the other children.
shared_info = manager.dict()
hvac_agent1 = HVACAgent('hvac 1', shared_info)
hvac_agent2 = HVACAgent('hvac 2', shared_info, delay=0.1)
dumb_agent = DumbReadingAgent('dumb hvac1 reader', shared_info)
agents = (hvac_agent1, hvac_agent2, dumb_agent)
list(map(lambda a: a.start(), agents))
sleep(1)
list(map(lambda a: a.stop(), agents))
list(map(lambda a: a.join(), agents))
# Not quite sure what happens to the shared dictionary after
# the manager dies, so for safety make a local copy.
results = dict(shared_info)
pprint(results)

Conditional if in asynchronous python program with twisted

I'm creating a program that uses the Twisted module and callbacks.
However, I keep having problems because the asynchronous part goes wrecked.
I have learned (also from previous questions..) that the callbacks will be executed at a certain point, but this is unpredictable.
However, I have a certain program that goes like
j = calc(a)
i = calc2(b)
f = calc3(c)
if s:
combine(i, j, f)
Now the boolean s is set by a callback done by calc3. Obviously, this leads to an undefined error because the callback is not executed before the s is needed.
However, I'm unsure how you SHOULD do if statements with asynchronous programming using Twisted. I've been trying many different things, but can't find anything that works.
Is there some way to use conditionals that require callback values?
Also, I'm using VIFF for secure computations (which uses Twisted): VIFF

Maybe what you're looking for is twisted.internet.defer.gatherResults:
d = gatherResults([calc(a), calc2(b), calc3(c)])
def calculated((j, i, f)):
if s:
return combine(i, j, f)
d.addCallback(calculated)
However, this still has the problem that s is undefined. I can't quite tell how you expect s to be defined. If it is a local variable in calc3, then you need to return it so the caller can use it.
Perhaps calc3 looks something like this:
def calc3(argument):
s = bool(argument % 2)
return argument + 1
So, instead, consider making it look like this:
Calc3Result = namedtuple("Calc3Result", "condition value")
def calc3(argument):
s = bool(argument % 2)
return Calc3Result(s, argument + 1)
Now you can rewrite the calling code so it actually works:
It's sort of unclear what you're asking here. It sounds like you know what callbacks are, but if so then you should be able to arrive at this answer yourself:
d = gatherResults([calc(a), calc2(b), calc3(c)])
def calculated((j, i, calc3result)):
if calc3result.condition:
return combine(i, j, calc3result.value)
d.addCallback(calculated)
Or, based on your comment below, maybe calc3 looks more like this (this is the last guess I'm going to make, if it's wrong and you'd like more input, then please actually share the definition of calc3):
def _calc3Result(result, argument):
if result == "250":
# SMTP Success response, yay
return Calc3Result(True, argument)
# Anything else is bad
return Calc3Result(False, argument)
def calc3(argument):
d = emailObserver("The argument was %s" % (argument,))
d.addCallback(_calc3Result)
return d
Fortunately, this definition of calc3 will work just fine with the gatherResults / calculated code block immediately above.

You have to put if in the callback. You may use Deferred to structure your callback.

As stated in previous answer - the preocessing logic should be handled in callback chain, below is simple code demonstration how this could work. C{DelayedTask} is a dummy implementation of a task which happens in the future and fires supplied deferred.
So we first construct a special object - C{ConditionalTask} which takes care of storring the multiple results and servicing callbacks.
calc1, calc2 and calc3 returns the deferreds, which have their callbacks pointed to C{ConditionalTask}.x_callback.
Every C{ConditionalTask}.x_callback does a call to C{ConditionalTask}.process which checks if all of the results have been registered and fires on a full set.
Additionally - C{ConditionalTask}.c_callback sets a flag of wheather or not the data should be processed at all.
from twisted.internet import reactor, defer
class DelayedTask(object):
"""
Delayed async task dummy implementation
"""
def __init__(self,delay,deferred,retVal):
self.deferred = deferred
self.retVal = retVal
reactor.callLater(delay, self.on_completed)
def on_completed(self):
self.deferred.callback(self.retVal)
class ConditionalTask(object):
def __init__(self):
self.resultA=None
self.resultB=None
self.resultC=None
self.should_process=False
def a_callback(self,result):
self.resultA = result
self.process()
def b_callback(self,result):
self.resultB=result
self.process()
def c_callback(self,result):
self.resultC=result
"""
Here is an abstraction for your "s" boolean flag, obviously the logic
normally would go further than just setting the flag, you could
inspect the result variable and do other strange stuff
"""
self.should_process = True
self.process()
def process(self):
if None not in (self.resultA,self.resultB,self.resultC):
if self.should_process:
print 'We will now call the processor function and stop reactor'
reactor.stop()
def calc(a):
deferred = defer.Deferred()
DelayedTask(5,deferred,a)
return deferred
def calc2(a):
deferred = defer.Deferred()
DelayedTask(5,deferred,a*2)
return deferred
def calc3(a):
deferred = defer.Deferred()
DelayedTask(5,deferred,a*3)
return deferred
def main():
conditional_task = ConditionalTask()
dFA = calc(1)
dFB = calc2(2)
dFC = calc3(3)
dFA.addCallback(conditional_task.a_callback)
dFB.addCallback(conditional_task.b_callback)
dFC.addCallback(conditional_task.c_callback)
reactor.run()

Any tips for a more pythonic way to implement state machine behavior?

For brevity, I'm just showing what can/must occur in states. I haven't run into any oddities in the state machine framework itself.
Here is a specific question:
Do you find it confusing that we have to return StateChange(...) and StateMachineComplete(...) whereas some of the of the other actions like some_action_1(...) and some_action_2(...) need not be returned - they're just direct method invocations?
I think that StateChange(...) needs to return because otherwise code beyond the StateChange(...) call will be executed. This isn't how a state machine should work! For example see the implementation of event1 in the ExampleState below
import abc
class State(metaclass=abc.ABCMeta):
# =====================================================================
# == events the state optionally or must implement ====================
# =====================================================================
# optional: called when the state becomes active.
def on_state_entry(self): pass
# optional: called when we're about to transition away from this state.
def on_state_exit(self): pass
#abc.abstractmethod
def event1(self,x,y,z): pass
#abc.abstractmethod
def event2(self,a,b): pass
#abc.abstractmethod
def event3(self): pass
# =====================================================================
# == actions the state may invoke =====================================
# =====================================================================
def some_action_1(self,c,d,e):
# implementation omitted for brevity
pass
def some_action_2(self,f):
# implementation omitted for brevity
pass
class StateChange:
def __init__(self,new_state_type):
# implementation omitted for brevity
pass
class StateMachineComplete: pass
class ExampleState(State):
def on_state_entry(self):
some_action_1("foo","bar","baz")
def event1(self,x,y,z):
if x == "asdf":
return StateChange(ExampleState2)
else:
return StateChange(ExampleState3)
print("I think it would be confusing if we ever got here. Therefore the StateChange calls above are return")
def event2(self,a,b):
if a == "asdf":
return StateMachineComplete()
print("As with the event above, the return above makes it clear that we'll never get here.")
def event3(self):
# Notice that we're not leaving the state. Therefore this can just be a method call, nothing need be returned.
self.some_action_1("x","y","z")
# In fact we might need to do a few things here. Therefore a return call again doesn't make sense.
self.some_action_2("z")
# Notice we don't implement on_state_exit(). This state doesn't care about that.

When I need a state machine in Python, I store it as a dictionary of functions. The indices into the dictionary are the current states, and the functions do what they need to and return the next state (which may be the same state) and outputs. Turning the crank on the machine is simply:
state, outputs = machine_states[state](inputs)

By putting the outgoing state changes in code you're obfuscating the whole process. A state machine should be driven by a simple set of tables. One axis is the current state, and the other is the possible events. You have two or three tables:
The "next-state" table that determines the exit state
The "action" table that determines what action to take
The "read" table that determines whether you stay on the current input event or move on to the next.
The third table may or may not be needed depending on the complexity of the input "grammar".
There are more esoteric variations, but I've never found a need for more than this.

I also struggled to find a good state_machine solution in python. So I wrote state_machine
It works like the following
#acts_as_state_machine
class Person():
name = 'Billy'
sleeping = State(initial=True)
running = State()
cleaning = State()
run = Event(from_states=sleeping, to_state=running)
cleanup = Event(from_states=running, to_state=cleaning)
sleep = Event(from_states=(running, cleaning), to_state=sleeping)
#before('sleep')
def do_one_thing(self):
print "{} is sleepy".format(self.name)
#before('sleep')
def do_another_thing(self):
print "{} is REALLY sleepy".format(self.name)
#after('sleep')
def snore(self):
print "Zzzzzzzzzzzz"
#after('sleep')
def big_snore(self):
print "Zzzzzzzzzzzzzzzzzzzzzz"
person = Person()
print person.current_state == person.sleeping # True
print person.is_sleeping # True
print person.is_running # False
person.run()
print person.is_running # True
person.sleep()
# Billy is sleepy
# Billy is REALLY sleepy
# Zzzzzzzzzzzz
# Zzzzzzzzzzzzzzzzzzzzzz
print person.is_sleeping # True

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Asynchronous Cassandra Inserts - python

Related

Threading, multiprocessing shared memory and asyncio

Python Multiprocessing.Pool in a class

Issue with sharing data between Python processes with multiprocessing

Conditional if in asynchronous python program with twisted

Any tips for a more pythonic way to implement state machine behavior?

Categories

Resources