How to notify a parent thread of job completion in Python - python

I would like to use the following code to find a specific number of permutations in a list.
def permutations(ls, prefix, result):
if ls:
for i in range(len(ls)):
permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result)
else:
result.append(prefix)
return result
My issue is that I cannot simply include another parameter to count the number of permutations found. I can't do that because each recursive call to permutations() "splits" into a new "version" (it's like a fork in the road with the old counter and each branch counts its own number of permutations found, not communicating with the others). In other words, this won't work:
def permutations(ls, prefix, result, count, limit):
if count > limit:
return
if ls:
for i in range(len(ls)):
permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result)
else:
count += 1
result.append(prefix)
return result
So what I would like to do instead of including a count parameter in the function signature, is to notify some other part of my program every time a new permutation is found, and keep track of the count that way. This may be a good use of threading, but I would like to do it without parallelization if possible (or at least the simplest parallelized solution possible).
I realize that I would have to spawn a new thread at each call to permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result) in the for loop.
I'm hoping that someone would be so kind as to point me in the right direction or let me know if there is a better way to do this in Python.

If you are not using threading, then I recommend not using threading and also not thinking in terms of using threading.
The reason is that the more simply and directly you are able to tackle a problem, the easier it is to think about.
As a second tip, any time you find yourself iterating through permutations, you probably should find a better approach. The reason is that the number of permutations of length n grows as n!, and depending on what you are doing/your patience, computers top out somewhere between n=10 and n=15. So finding ways to count without actually iterating becomes essential. How do do that, of course, depends on your problem.
But back to the problem as stated. I would personally solve this type of problem in Python using generators. That is, you have code that can produce the next element of the list in a generator, and then elsewhere you can have code that processes it. This allows you to start processing your list right away, and not keep it all in memory.
In a language without generators, I would tackle this with closures. That is you pass in a function (or object) that you call for each value, which does whatever it wants to do. That again allows you to separate the iteration logic from the logic of what you want to do with each iteration.
If you're working with some other form of cooperative multi-tasking, use that instead. So, for example, in JavaScript you would have to figure out how to coordinate using Promises. (Luckily the async/await syntax lets you do that and make it look almost like a generator approach. Note that you may wind up with large parts of the data set in memory at once. How to avoid that is a topic in and of itself.) For another example, in Go you should use channels and goroutines.
I would only go to global variables as a last resort. And if you do, remember that you need enough memory to keep the entire data set that you iterated over in memory at once. This may be a lot of memory!
I prefer all of these over the usual multi-threading approach.

Related

Pythonic pattern for building up parallel lists

I am new-ish to Python and I am finding that I am writing the same pattern of code over and over again:
def foo(list):
results = []
for n in list:
#do some or a lot of processing on N and possibly other variables
nprime = operation(n)
results.append(nprime)
return results
I am thinking in particular about the creation of the empty list followed by the append call. Is there a more Pythonic way to express this pattern? append might not have the best performance characteristics, but I am not sure how else I would approach it in Python.
I often know exactly the length of my output, so calling append each time seems like it might be causing memory fragmentation, or performance problems, but I am also wondering if that is just my old C ways tripping me up. I am writing a lot of text parsing code that isn't super performance sensitive on any particular loop or piece because all of the performance is really contained in gensim or NLTK code and is in much more capable hands than mine.
Is there a better/more pythonic pattern for doing this type of operation?
First, a list comprehension may be all you need (if all the processing mentioned in your comment occurs in operation.
def foo(list):
return [operation(n) for n in list]
If a list comprehension will not work in your situation, consider whether foo really needs to build the list and could be a generator instead.
def foo(list):
for n in list:
# Processing...
yield operation(n)
In this case, you can iterate over the sequence, and each value is calculated on demand:
for x in foo(myList):
...
or you can let the caller decide if a full list is needed:
results = list(foo())
If neither of the above is suitable, then building up the return list in the body of the loop as you are now is perfectly reasonable.
[..] so calling append each time seems like it might be causing memory fragmentation, or performance problems, but I am also wondering if that is just my old C ways tripping me up.
If you are worried about this, don't. Python over-allocates when a new resizing of the list is required (lists are dynamically resized based on their size) in order to perform O(1) appends. Either you manually call list.append or build it with a list comprehension (which internally also uses .append) the effect, memory wise, is similar.
The list-comprehension just performs (speed wise) a bit better; it is optimized for creating lists with specialized byte-code instructions that aid it (LIST_APPEND mainly that directly calls lists append in C).
Of course, if memory usage is of concern, you could always opt for the generator approach as highlighted in chepners answer to lazily produce your results.
In the end, for loops are still great. They might seem clunky in comparison to comprehensions and maps but they still offer a recognizable and readable way to achieve a goal. for loops deserve our love too.

Python queue like data structure

I'm working on a list that receives new elements once in a while. When these new elements have been added, I want to perform a computation over these elements (to be precise, estimate a KDE). I quickly realized that if this list were to grow unbounded, the computation of the KDE function would take extremely long, so I thought a Queue would be a good data structure to use. The standard Python Queue (https://docs.python.org/2/library/queue.html), however, does not allow for access to individual Queue elements without 'popping' them out of the queue. Is there any alternative?
In other words: is there some Python library that allows me to get a queue element without popping it? (or that allows array-like indexing of the queue elements?)
It sounds like you would get good use from using deque:
https://docs.python.org/2/library/collections.html#collections.deque
I dont understand why u use queue if u dont use the popping mechanism. If u are wondering about to owercrowd in your array u may use 1 array and 1 queue. First is waiting queue and second is processing array.
And u may do some optimizations about your loop to speed it up.
For example u may change
import xxx
for a in b_array:
xxx.do_something(a)
to this:
import xxx
ds = xxx.do_something #linking a function in memory speeds up foreach performance very much
for a in b_array:
ds(a)
I think your problem is not about queue size. If it is, u must check your early code.
As #RemcoGerlich suggested, the best way forward (I believe) is to maintain an index pointer that 'memorizes' where the next suitable write position is, modulo the size of the list. This will allow for very fast implementation using numpy and will also allow me to achieve the goals I specified.

Python sorted consumes tons of memory? (from a power set generator)

I'm basically looking for some feedback from others that might have an opinion on this. The following is not exactly what I'm working on but the sample code does reproduce the issue.
I have a power set generator that returns all the permutations if a basic list I'm sending passing in. I need to sort the generated sets (in my real case the returned sets are tuples with a value that I want to sort by, the example below demonstrates the issue fine without it)
The issue is when I use sorted() on the power set generator, it blows memory usage up. I realize that 2^50 is a very large number, but without sorted memory usage is quite flat and so I'm wondering if there's a better way to sort a super large number of sets without running out of memory within a minute or two. This is running on Ubuntu with Python 2.6.5. (also required in this case)
def gen_powerset(seq):
if len(seq) <= 1:
yield seq
yield []
else:
for i in gen_powerset(seq[1:]):
yield [seq[0]]+i
yield i
def main():
initialSet = range(50)
powerset = sorted(gen_powerset(initialSet))
for i in powerset:
print i
if __name__ == "__main__":
main()
Disclaimer: If you try running this sample, please watch your memory utilization. Ctrl-C the sample if it nears 90% as your OS will start swapping memory to disk. If the sample is still running, your disk load will spike and really slow things down, making it hard to kill the sample in the first place.
without sorted, you never need to store more than 1 or 2 values at a time -- They get computed as they're needed because you're using generators (yield). Unfortunately, there is no good way to sort a list without knowing the whole thing (you can't yield a value from the sort until you've looked at all the items to make sure that the one you have is the smallest).
Of course, if you have 2 sorted sublists, you can merge them lazily, so you could build a sort which didn't store everything in memory at once based off a merge sort, but it would be horribly inefficient in the general case.
The reason memory usage is higher with sorted is that it has to load all the items into memory at once. Since you wrote a generator, it only yields one element at a time, and the way you're using it only uses one value at a time, so Python doesn't need to keep all the items around at once. But you can't sort them without having all of them available.
You can't get around this as long as you're doing sorting, because the sort has to have all elements available.
The only way to get around the problem would be to rewrite your powerset generator to generate the items in the order you want. This may or may not be possible depending on exactly what order you want.
You're using a generator which only creates one value at a time before it is consumed, this is very memory efficient. The sorted function will need to convert that to a list so it all resides in memory at once. There's no way around it.

Writing reusable code

I find myself constantly having to change and adapt old code back and forth repeatedly for different purposes, but occasionally to implement the same purpose it had two versions ago.
One example of this is a function which deals with prime numbers. Sometimes what I need from it is a list of n primes. Sometimes what I need is the nth prime. Maybe I'll come across a third need from the function down the road.
Any way I do it though I have to do the same processes but just return different values. I thought there must be a better way to do this than just constantly changing the same code. The possible alternatives I have come up with are:
Return a tuple or a list, but this seems kind of messy since there will be all kinds of data types within including lists of thousands of items.
Use input statements to direct the code, though I would rather just have it do everything for me when I click run.
Figure out how to utilize class features to return class properties and access them where I need them. This seems to be the cleanest solution to me, but I am not sure since I am still new to this.
Just make five versions of every reusable function.
I don't want to be a bad programmer, so which choice is the correct choice? Or maybe there is something I could do which I have not thought of.
Modular, reusable code
Your question is indeed important. It's important in a programmers everyday life. It is the question:
Is my code reusable?
If it's not, you will run into code redundancies, having the same lines of code in more than one place. This is the best starting point for bugs. Imagine you want to change the behavior somehow, e.g., because you discovered a potential problem. Then you change it in one place, but you will forget the second location. Especially if your code reaches dimensions like 1,000, 10,0000 or 100,000 lines of code.
It is summarized in the SRP, the Single-Responsibilty-Principle. It states that every class (also applicable to functions) should only have one determination, that it "should do just one thing". If a function does more than one thing, you should break it apart into smaller chunks, smaller tasks.
Every time you come across (or write) a function with more than 10 or 20 lines of (real) code, you should be skeptical. Such functions rarely stick to this principle.
For your example, you could identify as individual tasks:
generate prime numbers, one by one (generate implies using yield for me)
collect n prime numbers. Uses 1. and puts them into a list
get nth prime number. Uses 1., but does not save every number, just waits for the nth. Does not consume as much memory as 2. does.
Find pairs of primes: Uses 1., remembers the previous number and, if the difference to the current number is two, yields this pair
collect all pairs of primes: Uses 4. and puts them into a list
...
...
The list is extensible, and you can reuse it at any level. Every function will not have more than 10 lines of code, and you will not be reinventing the wheel everytime.
Put them all into a module, and use it from every script for an Euler Problem related to primes.
In general, I started a small library for my Euler Problem scripts. You really can get used to writing reusable code in "Project Euler".
Keyword arguments
Another option you didn't mention (as far as I understand) is the use of optional keyword arguments. If you regard small, atomic functions as too complicated (though I really insist you should get used to it) you could add a keyword argument to control the return value. E.g., in some scipy functions there is a parameter full_output, that takes a bool. If it's False (default), only the most important information is returned (e.g., an optimized value), if it's True some supplementary information is returned as well, e.g., how well the optimization performed and how many iterations it took to converge.
You could define a parameter output_mode, with possible values "list", "last" ord whatever.
Recommendation
Stick to small, reusable chunks of code. Getting used to this is one of the most valuable things you can pick up at "Project Euler".
Remark
If you try to implement the pattern I propose for reusable functions, you might run into a problem immediately at point 1: How to create a generator-style function for this? E.g., if you use the sieve method. But it's not too bad.
My guess, create module that contain:
private core function (example: return list of n-th first primes or even something more generall)
several wrapper/util functions that use core one and prepare output different ways. (example: n-th prime number)
Try to reduce your functions as much as possible, and reuse them.
For example you might have a function next_prime which is called repeatedly by n_primes and n_th_prime.
This also makes your code more maintainable, as if you come up with a more efficient way to count primes, all you do is change the code in next_prime.
Furthermore you should make your output as neutral as possible. If you're function returns several values, it should return a list or a generator, not a comma separated string.

Delayed execution in python for big data

I'm trying to think about how a Python API might look for large datastores like Cassandra. R, Matlab, and NumPy tend to use the "everything is a matrix" formulation and execute each operation separately. This model has proven itself effective for data that can fit in memory. However, one of the benefits of SAS for big data is that it executes row by row, doing all the row calculations before moving to the next. For a datastore like Cassandra, this model seems like a huge win -- we only loop through data once.
In Python, SAS's approach might look something like:
with load('datastore') as data:
for row in rows(data):
row.logincome = row.log(income)
row.rich = "Rich" if row.income > 100000 else "Poor"
This is (too?) explicit but has the advantage of only looping once. For smaller datasets, performance will be very poor compared to NumPy because the functions aren't vectorized using compiled code. In R/Numpy we would have the much more concise and compiled:
data.logincome = log(data.income)
data.rich = ifelse(data.income > 100000, "Rich", Poor")
This will execute extremely quickly because log and ifelse are both compiled functions that operator on vectors. A downside, however, is that we will loop twice. For small datasets this doesn't matter, but for a Cassandra backed datastore, I don't see how this approach works.
Question: Is there a way to keep the second API (like R/Numpy/Matlab) but delay computation. Perhaps by calling a sync(data) function at the end?
Alternative ideas? It would be nice to maintain the NumPy type syntax since users will be using NumPy for smaller operations and will have an intuitive grasp of how that works.
I don't know anything about Cassandra/NumPy, but if you adapt your second approach (using NumPy) to process data in chunks of a reasonable size, you might benefit from the CPU and/or filesystem cache and therefore prevent any slowdown caused by looping over the data twice, without giving up the benefit of using optimized processing functions.
I don't have a perfect answer, just a rough idea, but maybe it is worthwhile. It centers around Python generators, in sort of a producer-consumer style combination.
For one, as you don't want to loop twice, I think there is no way around an explicit loop for the rows, like this:
for row in rows(data):
# do stuff with row
Now, feed the row to (an arbitrary number of) consumers that are - don't choke - generators again. But you would be using the send method of the generator. As an example for such a consumer, here is a sketch of riches:
def riches():
rich_data = []
while True:
row = (yield)
if row == None: break
rich_data.append("Rich" if row.income > 100000 else "Poor")
yield rich_data
The first yield (expression) is just to fuel the individual rows into riches. It does its thing, here building up a result array. After the while loop, the second yield (statement) is used to actually provide the result data to the caller.
Going back to the caller loop, it could look someting like this:
richConsumer = riches()
richConsumer.next() # advance to first yield
for row in rows(data):
richConsumer.send(row)
# other consumers.send(row) here
richConsumer.send(None) # make consumer exit its inner loop
data.rich = richConsumer.next() # collect result data
I haven't tested that code, but that's how I think about it. It doesn't have the nice compact syntax of the vector-based functions. But it makes the main loop very simple and encapsulates all processing in separate consumers. Additional consumers can be nicely stacked after each other. And the API could be further polished by pushing generator managing code behind e.g. object boundaries. HTH

Categories

Resources