I'm working on a list that receives new elements once in a while. When these new elements have been added, I want to perform a computation over these elements (to be precise, estimate a KDE). I quickly realized that if this list were to grow unbounded, the computation of the KDE function would take extremely long, so I thought a Queue would be a good data structure to use. The standard Python Queue (https://docs.python.org/2/library/queue.html), however, does not allow for access to individual Queue elements without 'popping' them out of the queue. Is there any alternative?
In other words: is there some Python library that allows me to get a queue element without popping it? (or that allows array-like indexing of the queue elements?)
It sounds like you would get good use from using deque:
https://docs.python.org/2/library/collections.html#collections.deque
I dont understand why u use queue if u dont use the popping mechanism. If u are wondering about to owercrowd in your array u may use 1 array and 1 queue. First is waiting queue and second is processing array.
And u may do some optimizations about your loop to speed it up.
For example u may change
import xxx
for a in b_array:
xxx.do_something(a)
to this:
import xxx
ds = xxx.do_something #linking a function in memory speeds up foreach performance very much
for a in b_array:
ds(a)
I think your problem is not about queue size. If it is, u must check your early code.
As #RemcoGerlich suggested, the best way forward (I believe) is to maintain an index pointer that 'memorizes' where the next suitable write position is, modulo the size of the list. This will allow for very fast implementation using numpy and will also allow me to achieve the goals I specified.
Related
I would like to use the following code to find a specific number of permutations in a list.
def permutations(ls, prefix, result):
if ls:
for i in range(len(ls)):
permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result)
else:
result.append(prefix)
return result
My issue is that I cannot simply include another parameter to count the number of permutations found. I can't do that because each recursive call to permutations() "splits" into a new "version" (it's like a fork in the road with the old counter and each branch counts its own number of permutations found, not communicating with the others). In other words, this won't work:
def permutations(ls, prefix, result, count, limit):
if count > limit:
return
if ls:
for i in range(len(ls)):
permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result)
else:
count += 1
result.append(prefix)
return result
So what I would like to do instead of including a count parameter in the function signature, is to notify some other part of my program every time a new permutation is found, and keep track of the count that way. This may be a good use of threading, but I would like to do it without parallelization if possible (or at least the simplest parallelized solution possible).
I realize that I would have to spawn a new thread at each call to permutations([*ls[0:i], *ls[i + 1:]], [*prefix, ls[i]], result) in the for loop.
I'm hoping that someone would be so kind as to point me in the right direction or let me know if there is a better way to do this in Python.
If you are not using threading, then I recommend not using threading and also not thinking in terms of using threading.
The reason is that the more simply and directly you are able to tackle a problem, the easier it is to think about.
As a second tip, any time you find yourself iterating through permutations, you probably should find a better approach. The reason is that the number of permutations of length n grows as n!, and depending on what you are doing/your patience, computers top out somewhere between n=10 and n=15. So finding ways to count without actually iterating becomes essential. How do do that, of course, depends on your problem.
But back to the problem as stated. I would personally solve this type of problem in Python using generators. That is, you have code that can produce the next element of the list in a generator, and then elsewhere you can have code that processes it. This allows you to start processing your list right away, and not keep it all in memory.
In a language without generators, I would tackle this with closures. That is you pass in a function (or object) that you call for each value, which does whatever it wants to do. That again allows you to separate the iteration logic from the logic of what you want to do with each iteration.
If you're working with some other form of cooperative multi-tasking, use that instead. So, for example, in JavaScript you would have to figure out how to coordinate using Promises. (Luckily the async/await syntax lets you do that and make it look almost like a generator approach. Note that you may wind up with large parts of the data set in memory at once. How to avoid that is a topic in and of itself.) For another example, in Go you should use channels and goroutines.
I would only go to global variables as a last resort. And if you do, remember that you need enough memory to keep the entire data set that you iterated over in memory at once. This may be a lot of memory!
I prefer all of these over the usual multi-threading approach.
Data: I have two very long lists (up to 500M elements each) which are extremely intensive used (read-only, no modification) by my parallel part of program. First is list of string, second is list of list of integer. Length of element (string or list) in both lists may vary from few (most of the cases) up to few hundreds or thousands (rare situation) of letter/integer.
In parallel code (sub-processes) the lists are used as a read-only data. When parallel execution is finished new elements may be add (append to end) to lists by main-process.
Question: The question is what is the best way to share such data between processes? Below are solutions which are consider (which I aware of). If I miss something please comment.
Possible solutions:
Array from multiprocessing module. It was suggested that it can be combined with numpy array. This solutions seems to be the most reasonable. But in requires fixed-length elements - this a real problem for me. I dont have enough RAM to set array element size to length of the longest items which may ever appear. Within this approach I think the most effective way is to set element length to some reasonable value (which will be capable to store most of entries) and remaining longest entries store is second array with big enough element length.
os.fork of main-process (with lists described in Data section). It is unix-only but this is not a problem for me. It should benefit from copy-on-write. But both strings (elements of 1st list) and list (elements of 2nd list) are object and probably reference counting will cause that each time data will be copy
redis (o other db which can store data in ram). I really worry about performans
store as mmap object. This gives me huge string-like (and file-like) object. Searching (iterating) over it probably will be the most time-consuming part of my program.
store in original form in one process and use queue/pipe to communicate with other processes. Previously, when I used multiprocess.pool.imap with small chunksize communication between processes kills performance. Therefore I believe such solution cannot be apply for really high traffic.
So, based on above it seems to me the most reasonable solution is to use Array and numpy (option 1) and try to split each list into few arrays with different element length. Any comments, suggestions? Did I miss something?
I'm basically looking for some feedback from others that might have an opinion on this. The following is not exactly what I'm working on but the sample code does reproduce the issue.
I have a power set generator that returns all the permutations if a basic list I'm sending passing in. I need to sort the generated sets (in my real case the returned sets are tuples with a value that I want to sort by, the example below demonstrates the issue fine without it)
The issue is when I use sorted() on the power set generator, it blows memory usage up. I realize that 2^50 is a very large number, but without sorted memory usage is quite flat and so I'm wondering if there's a better way to sort a super large number of sets without running out of memory within a minute or two. This is running on Ubuntu with Python 2.6.5. (also required in this case)
def gen_powerset(seq):
if len(seq) <= 1:
yield seq
yield []
else:
for i in gen_powerset(seq[1:]):
yield [seq[0]]+i
yield i
def main():
initialSet = range(50)
powerset = sorted(gen_powerset(initialSet))
for i in powerset:
print i
if __name__ == "__main__":
main()
Disclaimer: If you try running this sample, please watch your memory utilization. Ctrl-C the sample if it nears 90% as your OS will start swapping memory to disk. If the sample is still running, your disk load will spike and really slow things down, making it hard to kill the sample in the first place.
without sorted, you never need to store more than 1 or 2 values at a time -- They get computed as they're needed because you're using generators (yield). Unfortunately, there is no good way to sort a list without knowing the whole thing (you can't yield a value from the sort until you've looked at all the items to make sure that the one you have is the smallest).
Of course, if you have 2 sorted sublists, you can merge them lazily, so you could build a sort which didn't store everything in memory at once based off a merge sort, but it would be horribly inefficient in the general case.
The reason memory usage is higher with sorted is that it has to load all the items into memory at once. Since you wrote a generator, it only yields one element at a time, and the way you're using it only uses one value at a time, so Python doesn't need to keep all the items around at once. But you can't sort them without having all of them available.
You can't get around this as long as you're doing sorting, because the sort has to have all elements available.
The only way to get around the problem would be to rewrite your powerset generator to generate the items in the order you want. This may or may not be possible depending on exactly what order you want.
You're using a generator which only creates one value at a time before it is consumed, this is very memory efficient. The sorted function will need to convert that to a list so it all resides in memory at once. There's no way around it.
while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.
It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate
Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.
I'm trying to think about how a Python API might look for large datastores like Cassandra. R, Matlab, and NumPy tend to use the "everything is a matrix" formulation and execute each operation separately. This model has proven itself effective for data that can fit in memory. However, one of the benefits of SAS for big data is that it executes row by row, doing all the row calculations before moving to the next. For a datastore like Cassandra, this model seems like a huge win -- we only loop through data once.
In Python, SAS's approach might look something like:
with load('datastore') as data:
for row in rows(data):
row.logincome = row.log(income)
row.rich = "Rich" if row.income > 100000 else "Poor"
This is (too?) explicit but has the advantage of only looping once. For smaller datasets, performance will be very poor compared to NumPy because the functions aren't vectorized using compiled code. In R/Numpy we would have the much more concise and compiled:
data.logincome = log(data.income)
data.rich = ifelse(data.income > 100000, "Rich", Poor")
This will execute extremely quickly because log and ifelse are both compiled functions that operator on vectors. A downside, however, is that we will loop twice. For small datasets this doesn't matter, but for a Cassandra backed datastore, I don't see how this approach works.
Question: Is there a way to keep the second API (like R/Numpy/Matlab) but delay computation. Perhaps by calling a sync(data) function at the end?
Alternative ideas? It would be nice to maintain the NumPy type syntax since users will be using NumPy for smaller operations and will have an intuitive grasp of how that works.
I don't know anything about Cassandra/NumPy, but if you adapt your second approach (using NumPy) to process data in chunks of a reasonable size, you might benefit from the CPU and/or filesystem cache and therefore prevent any slowdown caused by looping over the data twice, without giving up the benefit of using optimized processing functions.
I don't have a perfect answer, just a rough idea, but maybe it is worthwhile. It centers around Python generators, in sort of a producer-consumer style combination.
For one, as you don't want to loop twice, I think there is no way around an explicit loop for the rows, like this:
for row in rows(data):
# do stuff with row
Now, feed the row to (an arbitrary number of) consumers that are - don't choke - generators again. But you would be using the send method of the generator. As an example for such a consumer, here is a sketch of riches:
def riches():
rich_data = []
while True:
row = (yield)
if row == None: break
rich_data.append("Rich" if row.income > 100000 else "Poor")
yield rich_data
The first yield (expression) is just to fuel the individual rows into riches. It does its thing, here building up a result array. After the while loop, the second yield (statement) is used to actually provide the result data to the caller.
Going back to the caller loop, it could look someting like this:
richConsumer = riches()
richConsumer.next() # advance to first yield
for row in rows(data):
richConsumer.send(row)
# other consumers.send(row) here
richConsumer.send(None) # make consumer exit its inner loop
data.rich = richConsumer.next() # collect result data
I haven't tested that code, but that's how I think about it. It doesn't have the nice compact syntax of the vector-based functions. But it makes the main loop very simple and encapsulates all processing in separate consumers. Additional consumers can be nicely stacked after each other. And the API could be further polished by pushing generator managing code behind e.g. object boundaries. HTH