Can someone provide an example and explain when and how to use Twisted's DeferredLock.
I have a DeferredQueue and I think I have a race condition I want to prevent, but I'm unsure how to combine the two.
Use a DeferredLock when you have a critical section that is asynchronous and needs to be protected from overlapping (one might say "concurrent") execution.
Here is an example of such an asynchronous critical section:
class NetworkCounter(object):
def __init__(self):
self._count = 0
def next(self):
self._count += 1
recording = self._record(self._count)
def recorded(ignored):
return self._count
recording.addCallback(recorded)
return recording
def _record(self, value):
return http.GET(
b"http://example.com/record-count?value=%d" % (value,))
See how two concurrent uses of the next method will produce "corrupt" results:
from __future__ import print_function
counter = NetworkCounter()
d1 = counter.next()
d2 = counter.next()
d1.addCallback(print, "d1")
d2.addCallback(print, "d2")
Gives the result:
2 d1
2 d2
This is because the second call to NetworkCounter.next begins before the first call to that method has finished using the _count attribute to produce its result. The two operations share the single attribute and produce incorrect output as a consequence.
Using a DeferredLock instance will solve this problem by preventing the second operation from beginning until the first operation has completed. You can use it like this:
class NetworkCounter(object):
def __init__(self):
self._count = 0
self._lock = DeferredLock()
def next(self):
return self._lock.run(self._next)
def _next(self):
self._count += 1
recording = self._record(self._count)
def recorded(ignored):
return self._count
recording.addCallback(recorded)
return recording
def _record(self, value):
return http.GET(
b"http://example.com/record-count?value=%d" % (value,))
First, notice that the NetworkCounter instance creates its own DeferredLock instance. Each instance of DeferredLock is distinct and operates independently from any other instance. Any code that participates in the use of a critical section needs to use the same DeferredLock instance in order for that critical section to be protected. If two NetworkCounter instances somehow shared state then they would also need to share a DeferredLock instance - not create their own private instance.
Next, see how DeferredLock.run is used to call the new _next method (into which all of the application logic has been moved). NetworkCounter (nor the application code using NetworkCounter) does not call the method that contains the critical section. DeferredLock is given responsibility for doing this. This is how DeferredLock can prevent the critical section from being run by multiple operations at the "same" time. Internally, DeferredLock will keep track of whether an operation has started and not yet finished. It can only keep track of operation completion if the operation's completion is represented as a Deferred though. If you are familiar with Deferreds, you probably already guessed that the (hypothetical) HTTP client API in this example, http.GET, is returning a Deferred that fires when the HTTP request has completed. If you are not familiar with them yet, you should go read about them now.
Once the Deferred that represents the result of the operation fires - in other words, once the operation is done, DeferredLock will consider the critical section "out of use" and allow another operation to begin executing it. It will do this by checking to see if any code has tried to enter the critical section while the critical section was in use and if so it will run the function for that operation.
Third, notice that in order to serialize access to the critical section, DeferredLock.run must return a Deferred. If the critical section is in use and DeferredLock.run is called it cannot start another operation. Therefore, instead, it creates and returns a new Deferred. When the critical section goes out of use, the next operation can start and when that operation completes, the Deferred returned by the DeferredLock.run call will get its result. This all ends up looking rather transparent to any users who are already expecting a Deferred - it just means the operation appears to take a little longer to complete (though the truth is that it likely takes the same amount of time to complete but has it wait a while before it starts - the effect on the wall clock is the same though).
Of course, you can achieve a concurrent-use safe NetworkCounter more easily than all this by simply not sharing state in the first place:
class NetworkCounter(object):
def __init__(self):
self._count = 0
def next(self):
self._count += 1
result = self._count
recording = self._record(self._count)
def recorded(ignored):
return result
recording.addCallback(recorded)
return recording
def _record(self, value):
return http.GET(
b"http://example.com/record-count?value=%d" % (value,))
This version moves the state used by NetworkCounter.next to produce a meaningful result for the caller out of the instance dictionary (ie, it is no longer an attribute of the NetworkCounter instance) and into the call stack (ie, it is now a closed over variable associated with the actual frame that implements the method call). Since each call creates a new frame and a new closure, concurrent calls are now independent and no locking of any sort is required.
Finally, notice that even though this modified version of NetworkCounter.next still uses self._count which is shared amongst all calls to next on a single NetworkCounter instance this can't cause any problems for the implementation when it is used concurrently. In a cooperative multitasking system such as the one primarily used with Twisted, there are never context switches in the middle of functions or operations. There cannot be a context switch from one operation to another in between the self._count += 1 and result = self._count lines. They will always execute atomically and you don't need locks around them to avoid re-entrancy or concurrency induced corruption.
These last two points - avoiding concurrency bugs by avoiding shared state and the atomicity of code inside a function - combined means that DeferredLock isn't often particularly useful. As a single data point, in the roughly 75 KLOC in my current work project (heavily Twisted based), there are no uses of DeferredLock.
Related
I have a variable. That variable will only ever be calculated by one continuously running thread, others will be able to access it but not adjust it in any way. The other threads will not cooperate with each other, so it doesn't matter whether one of the other thread thinks that a = 3 and other that a = 2. Here is a simple non-working example that demostrates what I mean:
number = 0
def thread_target():
while True:
global number
number = random.randint(1,60)
def thread_target2():
global number
for i in range(10):
print(number)
sleep(1)
t1 = Thread(target=thread_target)
t2 = Thread(target=thread_target2())
t1.start()
t2.start()
t2.join()
What is the "intended" tool/syntax for having something like this work?
This problem can be satisfied using a producer-consumer pattern with a single element buffer. The producer writes a new value as often as it needs to. The consumers read the value in the buffer without modifying it whenever they need to.
As stated in https://sworthodoxy.blogspot.com/2015/05/shared-resource-design-patterns.html, which describes a solution to the problem using an Ada protected object,
Unconditional Buffer
A single element buffer without any access barrier is used when the reading task only needs to sample data from the writing task. If the reading task executes faster than the writing task, the reading task will read the same value more than once. If the writing task executes faster than the reading task some values will be skipped. Unconditional buffers are often used when sampling sensor data. Sensor data may be delivered to a program at a rate many times faster than it can be analyzed. The unconditional buffer simplifies the communication between the task reading from the sensor and the task analyzing the sensor data.
protected type Read_Any_Buffer is
procedure Put(Item : in Element_Type);
function Get return Element_Type;
function Initialized return Boolean;
private
Value : Element_Type;
Is_Valid : Boolean := False;
end Read_Any_Buffer;
One issue with an unconditional buffer is determining if it contains valid data. It is unreasonable for the reading task to read uninitialized data. The function initialized can be polled to determine when the unconditional buffer has been initialized. After that happens the reading task merely calls the Get function whenever it wants access to the current value in the buffer.
protected body Read_Any_Buffer is
procedure Put(Item : in Element_Type) is
begin
Value := Item;
Is_Valid := True;
end Put;
function Get return Element_Type is
begin
if not Is_Valid then
raise Uninitialized_Data;
end if;
return Value;
end Get;
function Initialized return Boolean is
begin
return Is_Valid;
end Initialized;
end Read_Any_Buffer;
This example has the Get function raise the exception Uninitialized_Data if the function is called before data is initialized. The exception logic was placed in this function for safety only. It is much more efficient to poll the Initialized function than to iteratively handle exceptions.
As i know, we can't do this directly but we can use property() function decorator that anyone should not access the data without using the getter and setter method.
You can read how to use property() function in this link
from time import sleep
import threading
import random
class Number:
def __init__(self,number):
"""
_number is private, anyone should not acces this variable
without getter and setter method
"""
self._number = number
#property
def value(self):
return self._number
#value.setter
def value(self,arg):
if threading.current_thread().name=="Setter_thread":
self._number = arg
else:
print("you have no permission")
number = Number(0)
def thread_target():
while True:
number.value=random.randint(1,60)
sleep(1)
def thread_target2():
for i in range(5):
print(number.value)
sleep(1)
number.value=99 # will no effect
print(number.value)
t1 = threading.Thread(target=thread_target,name="Setter_thread")
t2 = threading.Thread(target=thread_target2)
t1.start()
t2.start()
t1.join()
t2.join()
I'm writing a Python class, let's call it CSVProcessor. Its purpose is the following:
extract data from a CSV file
process that data in an arbitrary way
update a database with the freshly processed data
Now it sounds like this is way too much for one class but it's already relying on high-level components for steps 1 and 3, so I only need to focus on step 2.
I also established the following:
the data extracted in step 1 would be stored in a list
every single element of that list needs to be processed individually and independently of one another by step 2
the processed data needs to come out of step 2 as a list in order for step 3 to be continued
It's not a hard problem, Python is amazingly flexible and in fact, I already found two solutions but I'm wondering which are the side effects of each (if any). Basically, which should be preferred over the other and why.
Solution 1
During runtime, my class CSVProcessor accepts in a function object, and uses it in step 2 to process every single element output by step 1. It simply aggregates the results from that function in an array and carries on with step 3.
Sample code (outrageously simplified but gives an idea):
class CSVProcessor:
...
def step_1(self):
self.data = self.extract_data_from_CSV()
def step_2(self, processing_function):
for element in self.data:
element = processing_function(element)
def step_3(self):
self.update_database(self.data)
Usage:
csv_proc = CSVProcessor()
csv_proc.step_1()
csv_proc.step_2(my_custom_function) # my_custom_function would defined elsewhere
csv_proc.step_3()
Solution 2
My class CSVProcessor defines an "abstract method" whose purpose is to process single elements in a concrete implementation of the class. Before runtime, CSVProcessor is inherited from by a new class, and its abstract method is overridden to process the elements.
class CSVProcessor:
...
def step_1(self):
self.data = self.extract_data_from_CSV()
def processing_function(self, element): # Abstract method to be overridden
pass
def step_2(self):
for element in self.data:
element = self.processing_function(element)
def step_3(self):
self.update_database(self.data)
Usage:
class ConcreteCSVProcessor:
def processing_function(self, element): # Here it gets overridden
# Do actual stuff
# Blah blah blah
csv_proc = ConcreteCSVProcessor()
csv_proc.step_1()
csv_proc.step_2() # No need to pass anything!
csv_proc.step_3()
In hindsight these two solutions share quite the same workflow, my question is more like "where should the data processing function reside in?".
In C++ I'd obviously have gone with the second solution but both ways in Python are just as easy to implement and I don't really see a noticeable difference in them apart from what I mentioned above.
And today there's also such a thing as considering one's ways of doing things more or less Pythonic... :p
I'm trying to study the python library Telepot by looking at the counter.py example available here: https://github.com/nickoala/telepot/blob/master/examples/chat/counter.py.
I'm finding a little bit difficult to understand how the DelegatorBot class actually works.
This is what I think I've understood so far:
1.
I see that initially this class (derived from "ChatHandler" class) is being defined:
class MessageCounter(telepot.helper.ChatHandler):
def __init__(self, *args, **kwargs):
super(MessageCounter, self).__init__(*args, **kwargs)
self._count = 0
def on_chat_message(self, msg):
self._count += 1
self.sender.sendMessage(self._count)
2.
Then a bot is created by instancing the class DelegatorBot:
bot = telepot.DelegatorBot(TOKEN, [
pave_event_space()(
per_chat_id(), create_open, MessageCounter, timeout=10
),
])
3.
I understand that a new instance of DelegatorBot is created and put in the variable bot. The first parameter is the token needed by telegram to authenticate this bot, the second parameter is a list that contains something I don't understand.
I mean this part:
pave_event_space()(
per_chat_id(), create_open, MessageCounter, timeout=10
)
And then my question is..
Is pave_event_space() a method called that returns a reference to another method? And then this returned method is invoked with the parameters (per_chat_id(), create_open, MessageCounter, timeout=10) ?
Short answer
Yes, pave_event_space() returns a function. Let's call that fn. fn is then invoked with fn(per_chat_id(), create_open, ...), which returns a 2-tuple (seeder function, delegate-producing function).
If you want to study the code further, this short answer probably is not very helpful ...
Longer answer
To understand what pave_event_space() does and what that series of arguments means, we have to go back to basics and understand what DelegatorBot accepts as arguments.
DelegatorBot's constructor is explained here. Simply put, it accepts a list of 2-tuples (seeder function, delegate-producing function). To reduce verbosity, I am going to call the first element seeder and the second element delegate-producer.
A seeder has this signature seeder(msg) -> number. For every message received, seeder(msg) gets called to produce a number. If that number is new, the companion delegate-producer (the one that shares the same tuple with the seeder) will get called to produce a thread, which is used to handle the new message. If that number has been occupied by a running thread, nothing is done. In essence, the seeder "categorizes" the message. It spawns a new thread if it sees a message belong to a new "category".
A delegate-producer has this signature producer(cls, *args, **kwargs) -> Thread. It calls cls(*args, **kwargs) to instantiate a handler object (MessageCounter in your case) and wrap it in a thread, so the handler's methods are executed independently.
(Note: In reality, a seeder does not necessarily returns a number and a delegate-producer does not necessarily returns a Thread. I have simplified above for clarity. See the reference for a full explanation.)
In earlier days of telepot, a DelegatorBot was usually made by supplying a seeder and a delegate-producer transparently:
bot = DelegatorBot(TOKEN, [
(per_chat_id(), create_open(MessageCounter, ...))])
Later, I added to handlers (e.g. ChatHandler) a capability to generate its own events (say, a timeout event). Each class of handlers get their own event space, so different classes' events won't mix. Within each event space, the event objects themselves also have a source id to identify which handler has emitted it. This architecture puts some extra requirements on seeders and delegate-producers.
Seeders have to be able to "categorize" events (in additional to external messages) and returns the same number that leads to the event emitter (because we don't want to spawn a thread for this event; it's supposed to be handled by the event emitter itself). Delegate-producers also have to pass the appropriate event space to the Handler class (because each Handler class gets a unique event space, generated externally).
For everything to work properly, the same event space has to be supplied to the seeder and its companion delegate-producer. And every pair of (seeder, delegate-producer) has to get a globally unique event space. pave_event_space() ensures these two conditions, basically patches some extra operations and parameters onto per_chat_id() and create_open() and making sure they are consistent.
Deeper still
Exactly how the "patching" is done? Why do I make you do pave_event_space()(...) instead of the more straight-forward pave_event_space(...)?
First, recall that our ultimate goal is to have a 2-tuple (per_chat_id(), create_open(MessageCounter, ...)). To "patch" it usually means (1) appending some extra operations to per_chat_id(), and (2) inserting some extra parameters to the call create_open(... more arguments here ...). That means I cannot let the user call create_open(...) directly because, once it is called, I cannot insert extra parameters. I need a more abstract construct in which the user specifies create_open but the call create_open(...) is actually made by me.
Imagine a function named pair, whose signature being pair(per_chat_id(), create_open, ...) -> (per_chat_id(), create_open(...)). In other words, it passes the first argument as the first tuple element, and creates the second tuple element by making an actual call to create_open(...) with remaining arguments.
Now, it reaches a point where I am unable to explain source code in words (I have been thinking for 30 minutes). The pseudo-code of pave_event_space looks like this:
def pave_event_space(fn=pair):
def p(s, d, *args, **kwargs):
return fn(append_event_space_seeder(s),
d, *args, event_space=event_space, **kwargs)
return p
It takes the function pair, and returns a pair-like function (signature identical to pair), but with a more complex seeder and more parameters tagged on. That's what I meant by "patching".
pave_event_space is the most often-seen "patcher". Other patchers include include_callback_query_chat_id and intercept_callback_query_origin. They all do basically the same kind of things: takes a pair-like function, returns another pair-like function, with a more complex seeder and more parameters tagged on. Because the input and output are alike, they can be chained to apply multiple patches. If you look into the callback examples, you will see something like this:
bot = DelegatorBot(TOKEN, [
include_callback_query_chat_id(
pave_event_space())(
per_chat_id(), create_open, Lover, timeout=10),
])
It patches event space stuff, then patches callback query stuff, to enable the seeder (per_chat_id()) and handler (Lover) to work cohesively.
That's all I can say for now. I hope this throws some light on the code. Good luck.
Here is a simple secinaro:
class Test:
def __init__(self):
self.foo = []
def append(self, x):
self.foo.append(x)
def get(self):
return self.foo
def process_append_queue(append_queue, bar):
while True:
x = append_queue.get()
if x is None:
break
bar.append(x)
print("worker done")
def main():
import multiprocessing as mp
bar = Test()
append_queue = mp.Queue(10)
append_queue_process = mp.Process(target=process_append_queue, args=(append_queue, bar))
append_queue_process.start()
for i in range(100):
append_queue.put(i)
append_queue.put(None)
append_queue_process.join()
print str(bar.get())
if __name__=="__main__":
main()
When you call bar.get() at the end of the main() function why does it still return an empty list? How can I make it so that the child process also works with the same instance of Test not a new one?
All answers appreciated!
In general, processes have distinct address spaces, so that mutations of an object in one process have no effect on any object in any other process. Interprocess communication is needed to tell a process about changes made in another process.
That can be done explicitly (using things like multiprocessing.Queue), or implicitly if you use a facility implemented by multiprocessing for this purpose. For example, a great deal of work is done under the covers to make changes to a multiprocessing.Queue visible across processes.
The easiest way in your specific example is to replace your __init__ function like so:
def __init__(self):
import multiprocessing as mp
self.foo = mp.Manager().list()
It so happens that an mp.Manager instance supports a list() method that creates a process-aware list object (really a proxy for a list object, which forwards list operations to an under-the-covers server process that maintains a single copy of "the real" list - the list object isn't really shared across processes, because that's impossible - but the proxies make it appear to be shared).
So if you make that change, your code will display the results you expect - and there is no simpler way.
Note that multiprocessing works better the less IPC (interprocess communication) you need, and that's true pretty much regardless of application or programming language.
Objects are copied between processes by pickling them and passing the string over a pipe. There is no way to achieve true "shared memory" for pure Python objects between processes. To achieve precisely this type of synchronization, take a look at the multiprocessing.Manager documentation (https://docs.python.org/2/library/multiprocessing.html#managers) which provides you with examples about synchronized versions of common Python container types. These are "proxied" containers where operations on the proxy send all arguments across the process boundary, pickled, and are then executed in the parent process.
I am writing a python interface which basically construct from a db row by row, send the stream to a tcp socket, another thread checkes the tcp response and decide if there's an error response, skip certain steams and retry from earlier ones.
Pseudo-code below, PK means PrimaryKey.
It's basically like this
def generate_msg(pk_start, pk_stop):
for x in db.query(pk>pk_startand pk<pk_stop):
yield pack_to_stream(x)
then the tcp socket send thread is like:
for msg in generate_msg(first_id, last_id):
socket.send(msg)
The problem is when the tcp socket read thread finds some error in response, the msg's pk is returned, so I need to restart the iterator from the pk
So here's my question:
what's the design parttern for a iterator which can move both forward and backward, esp. working with database row cursors
can I get the total count of an iterator in the first place without reading the whole list?
What's the general advice for my scenario?
Thanks
Iterators are designed to save memory by dealing with one item at a time, and can potentially produce an unlimited number of items. As a result of their design however, you usually cannot know their length without consuming the whole iterator, and you are normally not expected to be able to steer them.
That said, there is nothing stopping you from making a custom class that can be used both as an iterator and can provide additional functionality. Database cursors are the canonical example of such a class; the cursor can be iterated over to yield rows, but you can also ask it for a rowcount (so the length of the sequence), and get additional information about columns, get multiple rows, or point to a new result set by calling the .execute() method.
If you want to build a custom class that acts as an iterator, you need to give it a __iter__() method. You either make this method into a generator (by using the yield statement), or just return self and give your class a .next() method; the latter is expected to return one item (do not use yield), or raise StopIteration when no more items can be returned.
You can then add other methods that return length information, or re-set the query to start from a given primary key.
Untested, python-ish code:
class MessagesIterator(object):
def __init__(self, pk_start, pk_stop):
self.pk_start, self.pk_stop = pk_start, pk_stop
self.cursor = db.query("pk>? and pk<?", (pk_start, pk_stop))
def __iter__(self):
return self
def next(self):
return next(self.cursor) # raises StopIteration when done
def length(self):
return self.cursor.rowcount
def move_to(self, pk_start):
# Validate pk_start perhaps
self.pk_start = pk_start
self.cursor = db.query("pk>? and pk<?", (self.pk_start, self.pk_stop))